 So we are here to present our package that hopefully will be completely done by the end of the year. But right now we are on the development phase yet, development phase two. This package was developed by me, Lucio, and like it's a package that took many years and many ideas and many versions to come up. But we have already some published data that uses it. But the main idea of the package is to interpret RNA-seq, bulk RNA-seq results at the isoform level. So what is a generic transcriptome analysis for bulk RNA-seq? So you get your FASQ files from the facility or from the ones that you did yourself. You do an alignment against the genome or you do pseudo-alignment against the transcriptome and you do a differential expression at the gene level. And then you do a geo-terms like you do a functional enrichment or you do a volcano plot to show off specific genes. This is the most generic bulk RNA-seq analysis and it works very well for gene level. If you want to see the results of the expression of a single gene between two conditions, it works very well. The biggest problem is like this analysis, it kind of, how can I say? It copes that one thing is happening, that your genomic region is being transcribed into paramRNA. The paramRNA suffers splicing into mRNA into a mature mRNA and this mature mRNA is turning into a protein. And we know that it's not exactly that that happens. We have a much more biological variability on the splicing point and not just on the splicing but also on how the RNA polymerase to transcribes it starts and stops transcribing different transcripts. So in real, what happens is that you have a genomic region. If you look them at the gene level, that's what you're going to see. That genomic region suffers, bruis splicing can produce protein coding transcripts that transcripts that have the start codon, the end codon and only exos but they can also produce unproductive transcripts. Here we call unproductive anything that cannot be translated by canonical translation methods. Pretty much there are like two biggest unproductive groups, the processed transcript that lost the or for some, in some way can be because of intros, can be because of the loss of the start of our end codon. And the NMD is because they have a premature stop codon near the splicing junction. And these transcripts are marked for the NMD mechanism that on sense mediated the K mechanism to be degraded and the problem about that is that like we have a very big range of transcripts. And when you analyze them as only gene level, you're losing a lot of information. All right, I told, I still that but is there any meaningful impact of not analyzing a transcript level for a clinician or for a clinical results? So there's a very good example that's actually a textbook example and is related to the VGFR or fleet one, the VGF receptor. Because on normal conditions, what happens? Well, the VGF, VGFA connects to the receptor. It gets phosphorylated and it activates two types of cascades. The P83, P3K, AKT cascades and the ERK cascades. And these cascades are gonna activate proliferation, survival and endogenesis inside the cells. What the problem is, this is not the only protein that is related to the VGF region. This is only the 201 transcript. The fleet has many, many transcripts, six of them being protein coded. What happened is that when you have a fleet isoform that is not the true one, this fleet isoform doesn't have the receptor parts. So it sequesters the VGF before connecting to the membrane. And on this cell, you're gonna have less proliferation, less survival and no endogenesis. In this kind of data, if you did an analysis on the gene level, you would see a lot of VGFR expression and you were like, wow, but why there's no proliferation of survival and no endogenesis? Pretty much because that isoform is not the isoform that has the receptor. So it doesn't produce the phenotype associated normally with the receptor. So like what we have today is that a lot of people are doing long reads of a lot of genomes and we have like a very big variety of isoforms from each, from each type of isoforms. There's also the long non-code RNA that we are not gonna be touching. They are like completely different from this. What we're gonna be touching today are all the isoforms that can be called unproductive, that they are non-code in isoforms processed from code in genes. The RNA poll transcribes a code in gene, but it becomes an unproductive isoform because of problems of start and end sites or problems of splicing. So one like actual example of published data it's from the first paper that we use this package. What we did here is that we reanalyze public data that was SARS-CoV-2 infected 8549 cells and we analyze that at gene level and at transcript level. These are, this is all public data. So the authors have analyzed first at gene level only and then we went and did a transcript level expression. If you follow this QR code, you can go to the paper. So like, if you look at the analysis at the gene level this is what you see. This is the each of the cell types treated, that infected or unaffected with COVID. But if you look at transcript level you see a very big diversity compared to the gene level because there are multiple types of transcripts and like, and then you ask yourself, why does it matter? So like, if you have, by example, a retaining to transcript related to a gene that gene is not gonna be doing the canonical action of that protein because it's a retaining term. It's not even gonna codify a protein in most of the cases. So when we did like a term enrichment separating the productive isoforms and the unproductive isoforms, we found something very interesting. We found that SARS-CoV-2 was up regulating processes on the unproductive sites, process related to antigen processing and class I MHC, which means that the virus was itself the unproductive isoforms that would help his infection to the cell. And the productive isoforms were very focused on the cytokines and chemokine part. If you look here at specific genes, we see like HLAB that's a part of the MHC1 complex. You see that the gene is a little bit upregulated, but the transcripts that are upregulated are all retaining through. So this is not the canonical form or HLAB. And this happens on mostly all the transcripts associated with class I MHC complex. So today we're gonna present our workflow to do this kind of images and try to interpret one specific data set at transcript level. So what the workflow needs as input is your differential expression data, the latest annotation from gene codes. So right now the workflow only works for mice and for human data, unfortunately. It's because to do this type of analysis, you need a very good well-annotated transcriptome. You can't do that on bad-annotated transcriptomes. And we decided to do this input using Salmon plus the Gibbs sampling options from Salmon for pseudo-alignment because like we did multiple testings of that and we saw that like Salmon plus Gibbs sampling, it runs well multiple times with the same comment. This is like a correlation plot of just Salmon versus Salmon compared six independent libraries. And also this Bayesian approach to Salmon is gonna produce technical replicates that are gonna be used to switch to compute the differential expression then. And switch is a package that's not much used on the data, but it's a package from the creator of Salmon, Michael Love. And it's a new way to do this differential expression that we found out that for very deep transcriptomes, it works very well. Do we have anything on the chat, Moussa? Not yet, not yet, all right. So at start, we get our gene code annotation and we have that RTPM table and we include those to the workflow. And from that, we get an output as a first, a table that is gonna tell you what are the genes that their genes are not differential expressed but the transcripts are. These cases are very interesting because these may be cases of isoform switch. And what's isoform switch? Imagine that if there's an isoform that's up-regulated and another that's down-regulated, when the program computes that gene level, it's gonna sound those TPMs and it's gonna negate the gene level. So these specific transcripts, they are very interesting for that. We're gonna do a visualization of the transcripts using G-Ranges. And we're gonna do a log 2F change plot of all transcripts of a gene. Also one thing that it was like our collaborators' idea that became one of the most important parts of the workflow is that we fought in a very simple way. If a productive isoform affects a pathway, then the productive isoform, the product, the isoform that doesn't produce proteins is not affecting this pathway. So the idea he had was that he enriched isoforms separately, only the productive and only the productive. So you can see that the geo pathways are the hacktorn pathways and enrich it separately. And this is the thing that the package does also. And here are some examples from figures directly from the pathway that the package without any further preting. So this is an RNMT gene. You can see that if you would do a cut, the gene is not differentially expressed. Like it's a little bit non-regulated, but it doesn't pass on any of the four exchange cutoffs. But if you look at the isoforms, there are two isoforms that are expressed. This is a good question. Joy asked, is it fair to say transcript is a synonym for isoform? Like that, yeah. That is a thing that I asked myself because isoform is normally a nomeclature that people from the protein side use. Like they do different protein isoforms and different protein biotypes. I kind of prefer the different expressed transcripts because it's related more with the fact that they are transcribed. They are the recently transcribed. But you can also say isoforms. But I would say it can be a synonym. You just ignore the protein part because most of them are not going to produce proteins. Right now, I would say like 70% of all the annotated transcripts from the human genome are unproductive. So you can use both. But like keep in mind that they are not producing proteins. Yeah, that is also a really good discussion around the topic that is... They can mean different things when you are talking about the actual molecules inside the cell. If you say one transcript being produced in the nucleus, then you are talking specifically about one RNA molecule being expressed, like being being produced in the polymerase. But when you talk about the genome annotation, how we interpret what's in the DNA, then isoform is the best term to describe one genomic region that produces one RNA molecule that has the potential to produce one RNA molecule. Basically that. I choose this example for you guys because this is not a gene that is very known. And also it has like four different proteins annotated on Unipro. And they are exactly related to the four isoforms that we see here. These are the four productive isoforms. So they have different protein coding isoforms and these isoforms can be related to different phenotypical functions, like on the fleet that one produces a soluble protein and one of the produce a protein with a receptor. Here, when you look at them on the genomic context plot, you see this is the canonical isoform, the 201, and compared to it, the 202 and the 203. The 202 have a downstream site that is included and the 203 loses a huge part of the first taxa. And then you see the 209 that loses a small part of the first taxa. So you can see that like these, if these are translated, they would produce different proteins. And for the fact that they have complete orfs, they have the potential to be translated. Of course we can't affirm that they are gonna be translated without some specific data set to think about it. So like this was my overall explanation. Then we pass for the hands-on part, just a little bit on the data set that we're gonna use. The data set is from this paper from 2020. And I choose this data set for two reasons. First reason is it's a patient data set and they have paired case and control patients. This is a data set of a pregnant woman with and without preclampsia. And these are only the ones that had early onset preclampsia. So the phenotypes are very well-determined and this data set is very deep. All libraries have over 15 million reads. And for this kind of transcript level analysis, you really need this kind of depth. Because if not, when you come to this kind of analysis, it will be like very, the photo chains are gonna be very, very small and you're not gonna be very significant. So this data set is very good. Like the paper didn't even try to go to the different expression sites side of this data, but they produced a very good data set. And I choose this one today because I know most of you have a clinician background. So we were working with direct patient data and not sales. So we can pass through the hands-on part, but first I'm gonna stop sharing a bit. And I'm gonna see if there is any questions or specific problems in terminology of our concepts because like this type of what's splice making, what's an isoform or why do isoforms? It's something that needs to be clear before we go to next steps. So any questions? We hope you didn't scare anyone because this is more molecular biology. Yes, this is very molecular biology. Yeah, more deep in the molecular balance complex can be tricky sometimes. And it's not what is done by default in that type of analysis. So I hope we can at least give an overall explanation of why or why that would be important or how you can leverage more deeply than that kind of analysis. And also like you can ask me like, why isn't everybody doing transcript level then? I would say like, because we don't have deep data sets and we don't have annotated transcript terms to do that. So Joy asked the question, same RNA can result in different transcripts because of random error, or is this important because lots of transcripts on end up creating problems? That's a very good question. And like our boss in Brazil, she says that it's a way for the cell to subtly regulate itself. So like there are errors, but there are things that are favorable to not produce this kind of transcripts because like producing a protein is very costly for the cell. So if the cell can itself regulates of producing an unproductive version of that protein is better than changing the entire regulation around the region to just not transcribe. What's the secret? Indeed, this is actually a frontier in the knowledge of biology actually that especially in the neocariotic genomes, there is a lot of unanswered questions on why those things happen. Why you wouldn't select to be a simpler process? Why should we have regulation after the transcription not before the transcription? That kind of discussion is actually being done right now and it actually changes a lot between organisms. Some organisms prefer to do regulation before transcription or thereafter, but like Isabella said, the main thing is that's really, if you should take protein as the final product, it's a really costly product for the cells and that regulation specifically before producing the protein is one of the most effective one. And you shouldn't consider that RNAs are losing the cell just because of errors, because every step have control even inside the nucleus after exporting from the nucleus to the cell. The cell has machinery to control, mark those RNAs for degradation in different steps. So usually when you find an RNA molecule that can be sequenced, especially in the cytoplasm, it's probably a molecule that already went through several layers of regulation of processing. And of course, you can never be sure about things, but you can be really certain that it's a nectar RNA molecule that you identified in the cell and is there for some reason. The reason could be just regulation, but it's there. Yeah. And does our ask, what's the sequencing that that needed for this level of analysis? It's very discussed, but like I did extensive testing during my master's and pretty much like for gene level, you can pass by using like a 10 million library if you do like patient data set for transcript level is better if you have at least 15 million, 20 million per sample, because the most depth you have, the best you're going to get the isoforms that are not expressed as much. And for long-run coding or any isoforms, then you need to go for like 20, 30 million, but we're not talking about them today. So any other questions? You can open the microphone also to ask, if you don't mind. I guess not. Should we move on? So we move on. Do you want to get from there? So like, yeah, we can assume from here. First, I will talk about how to actually use the package and then we're going to do like an interactive demonstration of what should be expected. Let me share my school. There, we see the, there, we see the Github. Yeah. Basically, right, let me see if I can use the screen. I'm just going to post the link on the chat. The Zoom window separated from the other. Let me increase that a bit. So basically, like Isabella said, we are developing this R package based on analysis that we were used to do to answer the question of how can you analyze differential expression data at the isoform level. Our main goal is not explore alternative splicing, not say, oh, we find new isoforms, new splicing sites. It's basically trying to get functional interpretation of why those are being expressed or not. And like just interrupting, Lucie, a bit, the problem about finding new isoforms is that like all the packages that proposed to do that, they have like a very small, like, they have very small number of hits depending on the condition and the depth. And if you look at the notation that's produced by long reads, you see a vastly different number of isoforms. So like what we're trying here is trying to use the notation to interpret things. Thank you for sharing the link. And yeah, basically, right now, this package is on GitHub. It's being actively, it is doing the development phase. It's still going to be actively developed in the next few months. That is a bit of a motivation written here. I don't know if you guys are familiarized with how the package development works. But basically, this is all the code for the project is in that repository. Here we have some example data from some CSV files of the data set we are going to use. I will show how to load it directly using our studio. All the code is here. And usually we write a vignette about doing a description of what we do. In this vignette, we have a demonstration. The demonstration we are going to use here is all documented in this vignette. And also, if you run the GitHub page, there is also a link here to the handwritten version of that GitHub page that is using package down. And this here in the website, basically, is the same information that you can find on GitHub. But you also have a handwritten version of the vignette. But I will also show how to access that inside R. So the first thing would be, for installing the package, you have to use the remote package. So it has a functionality to install directly from GitHub for packages that are not hosted on Cran or by a conductor, for example. Here I'm showing my RStudio passion. Let me increase my hotkeys, not by English. Increase it a bit. They are not fine, I guess. I'm on the small Mac and sign. So do you guys have any problem with dark teams? Because it's definitely going to be a dark one. Yeah, the first thing you need to do if you don't have the package installed, you need to use the Emotes package to install RStudio. So you usually do the install package remotes. It's the first thing you need to do if you don't have it yet. Oops, it's just remote, right? No, wait, wait, wait, wait, wait. Oh, I'm missing the quotes. If you don't have it, do a Emotes. Yeah, it's saying that I already have that and it's loaded. So using Emotes, you can do install GitHub. And then you add the path to the name of the GitHub repository that is my username, Lucio RQ, and the name of the package is a formic. And we will actually install from GitHub everything that we need to use to know if everyone here. And by everything, we mean the input data set from the further preclamps analysis and all the functions and all the packages associated with it. Had some dependency errors, error dependency. Oh, the problem is the FGC dependencies. OK, yeah, I think I haven't added it by default to look on Bioconductor for the package. You need Bioconductor to install, let's see. We, so some of the packages that we depend on, they are not available on Chrome. So you need to install them from Bioconductor. And for that, you also need the package Biocymanager, Bioconductor is a repository for our package, basically used on the life science side of R. And they usually are good quality packages that go through a peer review process before being published and using, in that case, for example, of GC, you have to do library, Biocymanager, install. I'm going to do one chat, what she needs to do. It's this, and then you can install as a formic. Yeah, let me try one thing. For example, if you do Biocymanager, install. I think if you try to install directly from GitHub using Biocymanager, I'm not sure if it will install also the... I never tried. Yeah, because I know it works directly, it looks on GitHub for the package, but I'm not sure if it will also find the Bioconductor, the Bioconductor dependencies. But yeah, but basically, this is a package thing, so it's just installing. But FGSA and PlotGarden, it should install fast, I guess. Yeah, they have nothing. But maybe if the person don't have any Bioconductor package installed, actually Bioconductor installed a lot of dependencies before, so it can take some time, actually. Anyway, let's move on. For loading the package, we'll just use the library isoformic. And all the steps that Lucie is showing are on the Vignette, so, Lucie, if you want to glue paste and glue from the Vignette, I don't know. Yeah, I'm just showing they're doing the initial. Let's see, not too late. Oh yeah, and we are going to be presenting this package on the Bioconductor core for us. Yeah, at least a poster, we want to... No, no, it's alive. It will be a demo also, yeah. It's a demo, but it's a smaller... We are going to be there at the Bioconductor conference, everyone there. Yeah, we're going to be there, so I want to say hi. Yeah, anyone that is interested in bioinformatics and using AIR for biology, we are going to be there. It's a really nice conference. Go for it. Okay, but yeah, let's... Oh yeah. Yeah, you need to install, let's just say, I am PlotGardener. Just use this... Oh, wait, wait. I didn't post on the chat, sorry. I damaged someone, I'm sorry. Here, just use this, Kinga, and it should work. Install them first, then install isopharmic, and it should be fine. Yeah, it should update there. They read me to talk about the Bioconductor dependencies that we haven't done yet. Wait, I forgot how to see the vignettes. Wait, wait, wait, wait. I loaded the package. Sign. Let's use the help. The vignettes found, that's a problem. Probably it's not updated yet. Let me open the project. I don't know if I handle the vignette in the install, but if you install from GitHub, I think it don't... It actually don't handle the vignettes. But inside the... If you install from GitHub, inside the... Inside the folder, it should have the vignette. Yeah, inside the vignette directly, you have an armarkdown file that is the... Vignette. The vignette we are going to use. So we... This is the same content that is... Like I said, that is on the website. We are going to follow code check by code check here and try to explain the kind of input we need and how to actually adapt your data if it comes from a different kind of experiment. If you use... You can use different software to do the pre-processing. We are going to show how to do that. So the first thing, like I said, is loading the package. Anyone is still having problems with the installation, especially the bioconductor part, because we definitely can... Stop and go... Like I can go on the chat to see specific cases. Here in the documentation we discuss why we use Salmone as a Vela already mentioned it. At least Salmone in both Callisto, they have a method that they actually generate a... A Bayesian distribution... They use a Bayesian method to estimate the uncertainty of how a read can map to a transcript in the same genomic region that usually the genomic alignment methods like STAR or other genomic aligners, they usually don't give you that kind of estimate. Usually if the traditional alignment methods try to separate transcripts just by the depth in a specific exon regions... You have to pass it back exactly. Exactly. And then in the end, if your transcripts share the same exons, it's actually pretty hard to differentiate them. And then you can't really estimate, like in a bonus account, an expression level for specific transcripts. Actually you can, but several studies in the last few years actually show that Salmone and Callisto, they have a better estimation of the expression of the abundance of RNA molecules when using a quasi-mapping method. And the fish pondar package, that is the package where the SWISH method is implemented by Michael Love. Remembering Michael Love is actually the same out of the DC2 package. And DC2 is, I think it is one of the, definitely the most used RNA-seq gene-level expression, differential expression package, together with AGR. And actually Michael Love team also collaborated a lot with the Salmone authors to optimize the methods and actually use the kind of information Salmone provides to get a better transcript level differential expression using SWISH. And we haven't released yet this vignette on the website, but I'm working right now exactly having a vignette trying to show those initial steps, mostly because those initial steps, they depend on steps that are not based on R. Most of the time, the prep processing data, the prep processing of RNA-seq data is done on Linux on the command line. So some steps that we are going to provide some documentation and more discussion around that, but we are not ready yet. And also we are going to use the GenCode annotation of the human genome. Basically, GenCode is an extension of the, it uses the ensemble as base, and they have a better annotation, a better evidence-based annotation of isoforms with functionality of the isoforms. Meaning they have an annotation saying if it's, this transcript is known to be an intranetation, if it's known to go on the NMD pathways for the gradation. And that annotation is based on long-read sequencing. So it's way more like, it's easier to trust than when you call splicing events from short reads because on the long reads, when you get the full isoform, you can actually do this calling way easier. Yeah, it's good to remember that I was going to say. Yeah, actually there are several other projects that try to do functional annotation of the mammalian transcriptome, like the phantom project, even ensemble itself, and for other organisms. But we highly recommend using GenCode if you just want to get a dataset that is easier to get information, mostly because, especially when we go to the functional, interpretation is the functional analysis that we are going to show after, if you use a notation that have a lot of, I'm going to say, predicted genes or not genes without biological functional information, you definitely can continue the analysis, but on the functional part, you lose a lot of interpretability. So, and you're going to also show what's the format that the data should be. So, basically actually you could use any organism, any annotation, but you would need to add another layer of information to be able to run the analysis. I like pretty much the biggest problem, like the good thing about human and mice is that this is already done for us, the annotation of the genome. And even if compared, human is way better than mice, and like we decided to do this package pretty much because the difference between a gene level is because when you do a differential expression analysis at gene level, you're going to have like 500 differential express genes. When you do it at transcript level, you're going to have 3,000 different express transcripts. So, like you're just like, you increase the amount of work it takes to interpret that data and that package is precisely for that. Yeah, first time I'm going to show where the data we are going to use is actually storage. For every package, you can actually make available some data sets or files together with the package installation. So, I'm going to show that usually where you can find that information. Here I'm using a function from the FS package, but there is also a function from Bayes R that does the same, the system path system. Yeah, but this is a package, a function from the FS package that if you put the name of the package you want to see, it will show where in your actual system the package is installed. For example, I'm using a Mac system, a Mac OS. So, like we're saying that my R is configured to to install in that, especially because this is my development version, it's showing the package, the directory where I'm using it. But when you install it, from GitHub it will show the directory where your R was installed. And when you provide additional path, like in this example, it's actually loading this CSV file from a directory inside the package. For example, if you print this Xdata that's fixed, we have a table that is just a CSV file with the expression data. That is this. Lost it. Basically, like I said, we have some files here on the directory that contains the data from this CSV file. We have to, yeah. This is just a comma separated value, like a table of the expression data. And if they read CSV function, we can load that data frame. And that data frame, like I said, here we are starting from the step after you do the differential expression analysis. So here we have a table of the transcript ID. This is the session. Yeah, so we are basically assuming that the user did a differential expression at transcript level and that differential expression was done using the ensemble transcript ID. Because like the default, when you do Kalista or Salmon, you need to pass a reference for them to align to. And like the most used reference is GenCode. And these are the transcript IDs that you would get from GenCode. Oh yeah, that's just good to remember. If you follow the Salmon or Kalista, with the GenCode notation, you will get that by default. And you'll switch or D62 for the follow-up analysis. You will get exactly that table. Here, you probably will need just to change the column. But if you have any differential analysis, you can give that same information just changing the name of the columns. For example, if you have another annotation, you could try to map the column names and try to get this format. And then you'll be able to follow. Here, as you can see, we have a 72,000 annotations for transcripts. And that's a lot higher number than when you do just the genome level. And this is already cutting the things with 0, 0, 0. So like these are, because like right now for human, you have like 200,000 transcripts. So it's 72,000 out of 200,000. Yeah, here we are showing. Yeah, just the one that can be detected. And the expression can be detected. And then the other thing we need is the... The other thing we need is the actual annotation. And I remove the annotation. Let's go one by one. Yeah, I was going to show the files that we have on that directly. Here we already provide one dataset for the transcripts. One page in that case. One table for the differentially-expressed genes. But that is also ways of inferring the differential-expression genes from the transcript level, if you depend on the page you use. And this is the actual expression table. If you get the normalized accounts from each sample that we have. This is an experiment that had 12 samples. Yeah, it's 12 samples after the outliers were removed. Yeah. But it's not half and half case and control is like seven cases. And it's a bit more cases than control. For the analysis, you need a table describing the case and control to have metadata about your samples. Usually, this is the table that if you did differential-expression analysis, you would already have that table somewhere. It's basically just a data frame describing this sample. You can give a name to that sample, change the name if you want. But that name need to be the same name that is in the columns of the expression table. So this name needs to match the names you use on the... Just one thing also. The reference file needs to be the same reference that you use to align. Like we are using GenCode 33 here, but there's GenCode 40. Oh, another annotation tool. So like we don't do our... Like what he asked Lucio was that if we compare the results from our tool with another from another annotation tools at the same data. So like we didn't do our own notation. What we did was using the GenCode annotation. But like I use this with chess and we find today's idea as well. The problem is like they have less transcripts annotated like the last types of transcripts annotated. So it's harder to interpret. But Vitaly, we didn't do our own annotation. What we did was the visualization of those results. Pretty much. Yeah, like Isabella was saying. We tried to compare with other notations. But yeah, what we see is that definitely the annotation you are going to use changes everything because it's... When you summarize the expression at the gene level, if you just have... If your annotation just contain one or two transcripts per gene, then your transcript level expression is going to be almost the same as the gene level. So you can't really distinguish the contribution of individual transcripts to the gene level expression. And I also tried... Here we are showing the GenCode, but I also did with the Funtone annotation. The Funtone annotation has actually a lot more transcripts because they incorporated KG-seq data. KG-seq is a type of methodology that identify the transcript start side and they call different transcript start sides as different transcripts. But at least at the functional level, we don't have information of exactly how that different annotation impacts the functional level. So here we kept with the GenCode because it's easier to interpret, but definitely it can be done with whatever annotation and... And also you can include your own annotation. There's just the preprint that is out from the encode version four that they do a very interesting transcript classification based on types of transcripts. Is that something like we are thinking about incorporating? But the GenCode is the more direct one if you just want a result. Exactly. If you just want to enhance the analysis you already done probably with... Because it's a common problem that people do differential expression analysis thinking about the protein levels. Most of the time, especially when we are thinking about diseases or clinical practice, you do any... Most of the bioinformatics approach, the genomics approach, people go over it to actually interpret things at the protein level. Say that a mutation is causing a protein to be different that the RNA-6 is going to tell you that a protein is expressed in a different level. But actually what most of the data have until now say that actually you can't really correlate that. And one way of trying to get a more precise correlation from the expression data to the RNA expression data to the protein expression data is actually separating what part of that transcription really comes from protein coding RNA molecules or how non-coding isoforms could be impacting. Well, what's the contribution of the non-coding transcriptome to the protein transcriptome there? That's the kind of analysis we want to leverage here to help interpret the ectophenotype or the biological outcome, clinical outcome that you can see in your data. That right now, most of the database, most of the work that I already have been done usually don't go that deep. And actually there is plenty of data already available from published studies that can be interpreted in another level. And every day, no annotation comes. So yeah, I hope we discuss a bit about that. So here also we included in the package a function to help you access to download the references we need. So this download reference in the package is actually... Yeah, we have a documentation here that it's right now, it only downloads from the gene code and but you can actually choose the version you want to download and where it's going to save the data. We offer these three file formats, the GTF, GFF and the FASTA format. For the analysis we are going to describe here, you need both the FASTA and the GFF. Those files, they are usually like 50 or 60 megabytes. I already have them download here, but if you execute both of those functions, it will start downloading in your machine. It will take a couple of minutes. So we can discuss a little bit, like normally it takes like two minutes, three minutes. Yeah, it depends on the internet or anything, but it saves on this data hall directly inside the directory you are working. You can change the path using the... Output path argument. You just pass a directory for it to save, but by default it will create this data hall directly and save inside it. And then the other functions we are going to use, they buy the photo look inside the directory also. So if you change that, then the following steps you need to change also. We are working on trying to automate that step, but it's not ready yet. To use the TXCI meta and other packets to automatically identify the annotations. But science, especially if you use the... If you use it, Salmon, to get the transcript expression, you will already have those files because you need those files to do the Salmon. The Salmon to do... To the alignment. To actually to align and do those files. If you look at the file, it's going to be a master. Yeah, remembering that this is a compressed GZ file, but the actual FASTA file and the GFF file, they are in text format. So here I'm just showing what would look like the first lines of that format. Because like I said, if you really want to use that with another organism, if a notation that it's not in that format, you would need to change those files to at least generate the same format. For example, the GIN code FASTA file, it always uses that header in that format, the transcript ID, the GIN ID in the ensemble format. It also has some other notations that we are not going to use, but it also has the... Wait, what is it? The transcript name and the actual GIN name following the human genome, the UGO project, the human genome annotation, the length of the GIN. And this is actually the transcript, the length of the transcript. And this is actually one of the most important informations we use in the following step that is actually GIN code already gives us... An annotation of the type of transcript, and that's the information we use in the following steps. And actually Isabella made a really good figure trying to describe what those type of transcripts mean. Like I said, most of the times, the most important ones are the protein coding ones. Usually is what people want to study. But you also have what's called the... We have P-celled genes. They actually can be transcribed, but they sometimes are known to have different biological functions. Yeah, like in human, the P-celled genes is like... They are very low in the number of P-celled genes because most P-celled genes then are reclassified as processed transcripts or as protein coding. But like on other organisms, like mouse has a lot of P-celled genes. So in other organisms, the annotation is not as good. You're going to have a lot of P-celled genes. But like if we could summarize, it would be the protein coding, the processed, and everything that's categorized as NMD. Remember that this is how the ensemble project defined those names. So that depending on the annotation, you could find those names. Those names mean different things. And we also use... Yeah, and when we use that file, we also implemented a function. That function called makeTranscript. We use the term TX everywhere to refer to transcripts. So it's actually a practice that came from Michael Love on TX import and TX P-celled. That TX is an official acronym for transcript. So we have this TX... Make TX to gene table. That actually gets this faster file and extract a conversion table from transcript to gene. And also use... You also generate the... You also generate the... The metadata of the names and the type. Actually, this... It's an important moment to talk about the... You're going to see that actually we try to use the actual transcript ID instead of the transcript name or the gene name over the analysis. Because you see that... Actually, sometimes you find genes that are another type like different gene codes that are not taken to the same gene. Transcripts that are genes that sometimes don't have an official name yet, things like that. So usually if you convert directly to the gene name or the transcript name, the Ugo names, before doing the analysis, you are kind of biased that in the end you don't really know which genomic region, which genomic locals you're talking about. It's a pretty common mistake. So I like... Just show them the table. I think it's the best the text gene. Yeah, basically this function is going to generate the extract from the faster header, the relationship between, for example, the same gene here. If you look at the first three lines, is the... No, the first two lines is the ddx11l1 or have like microRNAs, long non-coding RNAs. What is the... What is the... Long non-coding RNAs? What is it? Oh, I did a head right there. Let me open the full table. Yeah, we have... Whatever gene you have in the human genome, you'll find here and all the transcripts related to them. Sometimes they, for some of the cases, they don't have some of the annotation, but actually the only unique identifier for them is the transcript ID. And the gene ID for the genes, in the case. And the part that we are interested on that we said is the transcript type. Like I said, it comes from the how ensemble engine code defines them. But we are after this, we will discuss how to use other annotations. But here you have, if it's a protein coding transcript, remember, this is a transcript level annotation. If it's known to go over the nonsense mediated decay, if it's a retained the interim, if it's a long non-coding RNA, it's a micro RNA. Some people ask me here on private message, like how trustful is this, that this is a retained the interim itself? It's very hard to say that. What we know is that on the long reads annotated by ensemble, these appear as a retained the interim. I, me and Lucio have found small errors on the gen code annotation before, but we always report them and they always update the to the next gen code. So it's like a community effort to improve the transcription annotation. Yeah, all that effort is actually trying to give you some light around what could be going on on the cell and it's not going to be the the definitive answer for anything. You always need more evidence after after those steps. And like said, actually, if you have a different annotation for those transcripts, you just need to change the table and provide change the include other genes, other regions of your interest. It can be adapted in the workflow. But yeah, gen code is is the easiest to interpret right now. And I'm not saying that is the best or it's going to save the the the humankind doing that, but it's it's the one we can work on and at least in a streamlined way without changing annotation without integrating. Well, we already we actually work with further annotates we try to integrate annotations when when we don't have information about specific genes. But yeah, and that's the that's the approach. Show the filter the extra gene on a view just so we can. Mm-hmm. Yeah, remember that that table. Those have like all the two point two hundred thousand transcripts. Also, I don't know if you should never use that in our studio. It's a really good function to show that is the view function. It's a capital view. It will always open a data table in a format like an Excel table in a spreadsheet that you can manipulate here. So, for example, if I want to find. I don't know. The video for a fleet fleet. It's VG. No, it's actually fleet. Well, do you want one? Yeah. For example, here we have several transcripts for fleet one and all of them are annotated as transcript. So, for example, if you look at the protein code in genes, but actually if you look to what, for example, just one way of interpreting that, if you look to the act of proteins that they generate, they generate actually really different protein codes. And we know actually that for fleet one, the different proteins have different. Phenotypes. Phenotypes associated then they are exported outside the cell. For example, the ATF three transcription factor is a good example of a protein code in transcripts annotated. For example, we know that it has like a longer and smaller isoform that it's known to go to, it's a transcription factor. So it needs to go to the nucleus to have the transcription factor activity. But actually it's known that one of those isoforms lack the subunit to interact with the DNA and actually regulates negatively the pathway if it's one specific even being a protein coding gene. So this is the bath three, but yeah. So I find we can. Yeah, just to show in. And with that table you can explore already some of the functional, the types of transcript we have in the data set. Then here we are going to with the files we were using with the files they are already in the data. Here we just prepare the differential express a gene table. Like I said here we already have the information of the P value the log for change for, for some of those genes for all of those genes. And yeah, and we basically if you have another table with that information just have to change the names. And here we are also using the the normalized accounts for each of those transcripts so we can plot generate some plots over the of the actual transcripts in relationship to the gene. In that that table is actually the most important because this is the transcript level expression table. And here. And this is actually already transformed to TPM is the normalized expression. Because we are not going to do most of the differential expression analogy that there is a good discussion around that that they use exactly the whole counts to to do the differential expression analysis. But here we are moving already beyond the differential expression analysis and for plotting and visualizing the TPM transcripts per million is the best normalization method to compare different different libraries, different samples. Not the same gene in different samples. There's also a lot of each step is really well documented. And we we use the transcript to gene. Annotation that table that we generated to actually add those columns that we need to the to the to the expression is from the expression data like what we're trying here is to construct the final tables with the dictionary information from Genco. Exactly. So like I said, actually the those steps we, we haven't tried to create a function in the package that do does everything automatically because this is actually those are the steps that are just data wrangling transformation of data. You'll see that here we use extensively the the pipes and the type diverse approach to select a lot of select a lot of left joined by the way left joy is the best function in our So basically, yeah, yeah, we work with the our approaches keeping things on separate tables and and merging then when needed. So we have a table of the expression table of the of the differential analysis and the table of the notation and we merge then when they are needed. We had that discussion in the past in the past if you were trying to get everything fused in one single object and so people just press one button and run the analysis but we also think that this step is actually crucial for you to understand the data. So you really understand that you have those annotations that you can change that annotation. Yeah, what is being being used. We are a bit of food school and Lucy was the one that taught me so like, I kind of feel a little bad that now we all use packages that all do as far and like as far objects and have a bunch of tables compacted in front and other, because mostly you don't even know what is there actually. So yeah, like I like to have separate tables for things. Yeah, actually, this is the approach opposite to what the by a conductive project aims that measure the by conductive project always try to get one data class one object that contains all the data. We are thinking of optimizing our workflow to integrate all that but we want to give the flexibility if you have already tables. Because it's really common by flexing to people have a lot of Excel spreadsheets on their computers things like that. We also want to leverage that you can create the objects from just lose spreadsheets. And of course implementing all the validation steps so your tables actually have the same information. So basically, we are thinking on which summarize the experiment is that kind of data class can can hold better that kind of data. So yeah, so here we are just filtering the the the genes and transcripts that we call differential express it. Basically, here we are using a cutoff of using pValue and full chain and now for zero zero five in the pValue and at least a positive or negative one and have a solid one in the old change. Yeah, this is basically for you to determine what you consider a differential express gene and what you consider differential express transcript, you can change these values and be more or less stringent. But like I would recommend to be less stringent right now. So you can be more stringent in the end. So you have more of it. Yes, exactly. And yeah, then in the end, like I said, here we are just walking with the tables. And here you see that. No, not that one. Oh, hey, that's all. You're going to have the structure names. Yeah, yeah, here we have the all the transcripts that were, no. Here are just the different, yeah, the differential express words. No, not here. It's not an extra. Oh, it's the next one. Yeah, here we here we we added the information back of which genes were differential express it to the transcript table. So we keep that information that it's going to be useful on there. So this is basically the final table we're going to be used as input for almost everything. That's the table with the transcript expression and the gene information and if that gene is differential express or not. Yeah, well, I just saw what was coming here. Yeah, no, definitely there are advantages on this four objects and and like I said, we actually we just puzzle. We created this workflow based on on the kind of output people who generates and we started seeing that actually, because we are going to say we first went out, are we going to use PX import data, data, the format that they output or we are going to use summarized experiment that I think, but then we saw that actually each package generated different deviation of the summarized experiment object and it was actually harder for us to to to keep everything tracks and most of the time people just have some tables that they generate low collaborators send me the table of differential express the genes. And so we actually started from that but we really aim to be compliant let's say to what the by conductor project. Yeah, perfect word will have a function to convert one from the other. Yeah, that's the next step. And then then the one of the clear. The benefits of using S for objects is that you can do the validation of the checking all the data and if they are in the right format in the instantiation of the of the class itself so this is definitely the biggest advantage and this is the one we want to because right now we are assuming that you prepared the table in the right format let's say we actually don't do much checks around it. I think Lucy we can execute the next and then we do like a 10 minute break before plotting the images. Just because like we already like 1215 1216. So we can do it and it's almost the half of the workflow. So the things like right now what we have we have two most important tables that that's table with the data information. And the table of that's that are not that's. Yeah, it could. If anyone is finding it hard to follow until now is basically because right now we are just transforming tables and and joining tables so we have a big the tables that contain all the all the information if a gene is differential express at the gene level and and also at the of the transcript level that kind of formation and in the next step we are going to start plotting that information so we can visualize and extract functional information from that. So if you guys have any gene that you like start thinking about them because we can try them on the next step. Yeah, we are going to generate active figures in the next step. And we are going to give a break until I'm at 10 or 15 minutes. I think 10 minutes fine. We come back on 1230. We come back on 1230 just for everybody. Like I'm going to stay here. So if you have any specific questions you can ask me on chat just to give Lucy a break and everybody else a break. All right. Yeah, I'll just just get some water but we will be around there. So if you guys have any questions or help installing your or like the way we are. Any our problems that that kind of thing we definitely want to discuss concepts because we are molecular biologists and the thing that we like the most is discussing names or things. Yeah, it's funny. I actually would be curious of asking how, how, how is the audience. How do we have a more basic analysis or have a more such as in Zenar conference people have a more technical background in that point at least. But actually I would be curious to see how. How is the demography on how many clinicians we have. Oh, I can do polls I just need to leave and come back in according to racial so I'm just going to leave and come back in and ratio is going to pass me the co-host again so I can do the polls and we can find out. Yeah. Because one of the most kind of still work at least I have been doing the less in the last year that I actually come from a biotechnology degree and then I dive deep in bioinformatics in the last few years during grad school. But since last year I started working in a department of pathology in the wild corner medical school and what we mostly do is actually having to talk with clinicians and explain that why we do some things in some ways in the molecular biology spectrum to help clinicians that drive better than because of the department of pathology to generate better biomarkers or help define new categories for disease things like that so this is actually a work we are definitely deeply interested in and that is one of the most beautiful languages for dealing with data and helping. Right now I can do polls so we want to know who are our audiences let's try. I will just grab your water. Hey Rachel I think I don't have the permission to create a new poll still. Oh no let me see if I can make you host if that works. No I don't think it does that's weird I can maybe I don't know what I'm doing also but yeah. If you want to chat me what you want I can create a poll here. Alright I'm going to chat to them thanks. No no you can go for a coffee right now we're going to go back in eight minutes at 12.30 alright everybody will have a break. I'm back if you need to have a break. Rachel can you pass me the call host again oh no no it's fine. I have the entire kit of things here. I made the poll that you asked Lucio. And most of our audiences should be computer scientists so I guess that's why we are discussing a lot of concepts like I'm going to be honest 90% of the questions you ask my answer is going to be depends so that's biology. So at least we want to make things that are actually really hard they are being less hard and show how data rendering and merging of information can help. That data science skills can help. Oh that's good I'm glad that we did a good job explaining it's just that like we know that it's not something that everybody does we know that most of the people that analyze our NAC are never going to think at transcript level and it's fine it's just that if you want especially like if you guys are computer scientists like you guys know the power of public data so if you want to do a re-analysis at transcript level at the X condition I guarantee you that nobody did that before. And it's actually a question of how the knowledge is evolving. It should think like in the last 20 years the genomic evolution was the first threat of the human genome. The promise of the human genome project was the first threat to us and that's actually the threat of all of the bioinformatics is we are going to know the genes we are going to know the genome and all the diseases are going to be solved and all the phenotypes are going to be understood and actually how much more we explore and understand more about genes about molecular biology actually we know we know we discovered that yeah we know there's a question do you want to ask you can open your mic as well or you can type in chat too. Can you hear my mic? Awesome. Yeah so I think more and more people especially at the biology level are thinking transcript because more and more people are analyzing their own data and whether it's you know single cell seek and they're using Alvin Fry or they're using you know bulk seek and they're using salmon because star they're the the CPU requirements of star are just too high for most people who just are doing a one off right if you're doing lots of stuff okay you know you set up a little EC2 instance or something on the cluster and you're good to go but I think more people are doing salmon now because it can run on like a MacBook Pro and then of course the output from that is is transcript quantification so they they always have to have a step to convert from transcript quantification to gene count level and I think that's really making a lot more biologists more aware of that difference between a transcript and a gene count and so I think that's you know that's really expanded in the last three years the amount of people who like know the difference and stuff I mean I think there's so much who don't know what they can do with those differences you know so I mean another another little plug you talk about using DC2 which is Mike Love's package I am on one of Mike Love's other packages Fishpond and so I did all the splicing that function for load fry to do our name velocity analysis from the Fishpond package you know I have a very good story about that it's because like I met Mike Love on the Code Spring Harbor thing and then I looked at him and like your package saved my master because your package Fishponds literally saved my masters thank you yeah we're really glad to have you here it's just that like and actually this is a good approach because how can I say the when you move from gene level expression to transcript level expression the main assumption on the differential expression analysis like on DC2 is that you have a negative binomial distribution of the count of reads and actually for transcripts we don't really know that by all and we don't really have that distribution following all the transcripts of a gene so actually we have a better estimate using a non-parametric approach like it's implemented in the switch method oh interesting that's a great point because there are multiple assumptions like a binomial model one is of course assuming the vast majority of transcripts are non-differentially expressed that's a great point I hadn't even thought about that the reason I'm here is to learn your stuff I have never done this on bulk before one thing that we noticed like during my master we worked with long non-coding RNA isoforms and one thing I noticed is that Fishpond captures way better the long non-coding RNA isoforms compared to all the other differential expression packages especially like if I compare to HR DC like I tried all of them and like Fishpond is the only one the switch function was the only one that actually gave me some results like some deeper results like we don't know if because of the biology the way that they are expressed or if it's something else but it's very cool we are obviously we're very proud of our switch so thank you so much thank you so much for that it's gonna I'll pass the word along to Mike people are really enjoying it we actually met Mike this November and we made sure to tell him that we were looking for this it feels like if Mike hadn't come to my poster this package would never have existed the motivation the motivation to do it but I think we can go back so now we are on the nicest part like at least for me the best part of my informatics is plotting pretty figures so we're a nice part yeah remembering that actually here we are not really discussing any new statistical approach and new computational methods for this that we are most showing how to interpret data that most of the time is already there and like the degree job show is saying that a lot of people is already using that information is already there and when you try to summarize that information back to Gini you are going one step back not one step beyond and you can have a much deeper biology if you go on that level yeah so basically with that table of the gene expression the table that has both the gene and transcript level differential expression we can generate some plots for example we have this function implemented in the package that try to do a basic bar plot showing if the transcript is is differentially expressed remembering that that definition of differential expression don't come from our method it will depend on the method you use it upstream like said in that case we are showing a data that was generated using swish and in that figure we always plot here in the first column the gene the summarized gene expression and let me say you can see that we for example this this gene you can see a transcript that's upregulated and another transcript that is downregulated both of them are protein coding genes but when you do the gene level differential expression the gene would appear as not being differential expression so you can always have the discussion of one transcript the other is going down so it should be considered as a zero expression or should be analyzed and separated there is no easy answer for that you always have to go back to the biology or try to do a functional analysis in this case this RBP is like it's a sign layer on the notch the canonical transcript is the 201 the first one so we see that the canonical transcript is not upregulated but two non-canonical transcripts are one is upregulated and other is downregulated so if the notch pathway is important for the condition in the case of preqlampsiais it would be a very important result exactly sometimes you if you just interpret the expression at the gene level it's not that you are doing something wrong but you are losing biology that can lead to other discoveries or other most of the time especially if you concentrate on the protein coding genes you can find the literature about specific kinds of forms specific especially here I'm not comparing the genomic context but if they share different forms if they have a different transcription start side sometimes they can generate totally different proteins so even just talking about the protein level you can find and sometimes if you also have experiments quantifying protein sometimes this this approach would lead you to a better integration of that data if this transcript generates a different protein from that one you can correlate the actual one that was upgraded or not here we are showing that that function also can plot more than one gene at the same time we have to fix the it's still not looking very good for the multiple thing but yeah actually in the end the that function is generating a ggplot object that you can modify at will for example here you could get this the that plot that plot no the plot object can apply any any ggplot ggplot code that you yeah I think I haven't loaded ggplot right library ggplot2 and then do a team but like just like Joy commented about the publication tip to transcript level the problem is convincing people that transcript level is important we have like a series of arguments very much that we go over and exactly turn the level like it's still not like perfect perfect but it's some plot that is easy to manipulate on ggplot and we are using ggplot for almost everything so it's a very easy to manipulate plot on the end but this is the most direct question like I have a question on chat alright there's no stupid questions Joy so let me see are the reads coming from specific tissue biologically I'm guessing that defer a cost tissue so with regard to public data if we wanted to practice what you're teaching we do want for looking for that specific genes related to specific tissue type yes the problem is like finding the data with the depth to do this analysis is not an easy thing so like you're completely right there will be different isoforms in different tissues and also like I think the new encodes paper explores it a lot if you want to look at the publication that explores it so yes like how to find data to do that I would go on SRA and check for high depth bulk RNA sequencing in a condition that you're interested yeah like RGO we'll direct it to Jill yeah the how much more established or peer your samples are you probably will find a better consistency in what transcripts are expressed in the information for doing experiments with cell lines with animal models that are genetic that have the genetic background you would expect that the same isoforms are expressed in the same tissue but if you are dealing with patient data then yeah most of the time that that's just going to be some variants that comes from genetic diversity but and that is where the power or the number of samples you need to have or the depth of sequencing need to be to be improved because yeah especially dealing with human data it's pretty hard because the the patients are not genetically homogene or yeah like for this dataset specific it was very hard to find one that actually had paired the patients well to do this yeah like let's move on because we have 3D graphs in CUNY I was just showing the fix here 01 01 yeah so like I say it's actually the package is still under development but we at least try to make all the tables and figures in a format that you can just continue the exploration of the data after that analysis um here I'm going to show you another yeah like this is a plot that like because here we see the isoforms itself but we don't see the difference between conditions we are plotting the full change and not the TPMs by itself but one thing it would be interesting it's actually show the difference on counts between the conditions or TPM between the conditions so that's what Lucy's going to show now like yeah exactly and now here we are comparing like the differential expression directly this is the full change not the expression in each individual sample and for example another way of exploring the data after you already I found that for EGFR there is a transcript that is largely differential express but it's because I don't know it's it's a low expression it's a genuine flower expression or what would be contributing with that specific isoform being differentially expressed so one way is actually exploring the the abundance in the transcription and we create just a second before you do this plot like there's a good question from Wes like the so the negative binomial model assumes raw data coming in do you think the transcript level and how different transcripts are more transient or sensitive to each other might make the technical batch effects louder when compared to experiments yeah and that's a thing that we know we notice it a lot is that we have a lot of experimental batch between the experiments like for patient data you need to be very careful in selecting the data set especially because of that because if you focus it should do a transcript level analysis the hands of the person doing and the things that are a problem but are not that big of a problem at gene level are going to show up more oh that's a pretty plot oh yeah this is basically just taking that same information here from the EGFR gene and trying to distill it in a more approachable way for example here we now I'm not sure if the colors are matched there and I think colors are not matched here but here we try to plot the gene expression with the still have to think about which matrix we are using here here I'm just plotting the mean in the the normalized mean we are using TPM in comparing the treatment against the control there are some genes that are very visual for that let me think for example here here we see one transcript that for example it has a higher expression in the treatment but actually it wasn't found in the control so you can't really say if it was increased or not sometimes it's just because it's a low expression gene and couldn't really be detected but the biggest thing here is that actually you see a trend that is an overall increase in the expression of the gene just that somehow it really it didn't pass in the study so it was because of the distribution because if you see here here I'm plotting the standard deviation around the mean of the TPM with those the signs up and down and the transparent ones are the translucent ones are the ones that didn't pass the differential expression cut off so you see that actually they spread here in the treatment is bigger than in the control so could it be that it was detected as a differential just because of the variance in the treatment treatment we're talking about preclamps in that case it's a disease it's not a fix try NDRG1 NDRG1 G1 so yeah we're doing it live oh yeah this is a cool one because the retaining thru isoforms are the ones that are very highly different those are the blue ones they have a higher expression in general so if you see that the ones that are close to zero they are probably they are less how can I say it's harder to infer something because they are just low abundance they are never necessarily for not detecting them because of technical things that you couldn't go deep enough to detect them but if you look to the higher expressed ones you see that there's a lot of knowledge around that you see processor transcripts attaining the introns and but actually some protein coding genes were differential expressed and at the gene level using the traditional cutoffs of 005 in the p-value and in the the fold change it would just be a not intensified and we could have brought more by your explanation I can then the NDRG1 is a downstream sign of the MEC pathway it's related to stress response and especially it's related to proliferation one thing that is related to the preeclampsia pathophysiology a lot is that the placenta doesn't proliferate as much as it needs like the placentation is very bad during preeclampsia and this stress sign is being up expressed but it's not the coding isoform that is highly expressed it's the non-coding one so we see that if this guy was central to the placentation to the differential expression in the placenta it's not producing the canonical protein and it's not acting on the canonical way so this is actually one of the plots that we actually the one of our most that we use actually a lot to explain that concept because when we first explain the concept of the gene not being differential expressed but some transcripts being differential expressed it's actually hard to explain hard to imagine what kind of data we are talking about and actually here we see a really good trend of some transcripts are if I'm not sure specific numbers but you see that the variance on the I'm going to say of the assigning of reads to that transcript are really good and then you could always explore it actually trying to do a mapping and see if on the genomic context you have different coverage on on introsites or boundaries of axons but actually this is something that the quasi-mapping approaches they they do really well is actually distinguish transcripts without the knowledge that those transcripts come from the same region they explain that yeah they just see if those two they the two RNA sequences that read would match better to one or the other without really knowing that they are the same axon or so and actually if you use the bootstrapping methods like the like sui she uses using gibi sample for example in the salmo implementation they are really good in distinguishing the uncertainty around the how that read was assigned to a specific transcript that comes from the same region because in some cases you are definitely not going to be able to differentiate them and that's normal that's part of the biology if you want to try another one just for us to do two examples try PAM let me generate another PAM PAM what is this genepen true this is a cool one you see that is not differential expressed at the gene level but the fold mean the PAM is very high like it's very high and this like there's a protein coding and are we trending through isoforms that are differential expressed the nice thing about this gene is that it's not very explored in literature there's a lot of results of ethon cancer so I'm just like still thinking about ethon the physiology of frequency but you see that like I wanted this example like oh you did your favorite genes that's good that's how we started if you have a favorite gene and you want us to try just put on chat you're studying stroma or cancer cancer matrix remodeling yeah matrix remodeling cancer is always there oops if you have a favorite gene and you want to try just go ahead this is a preclampsia data set but like I said it's easy to adapt this is a preclampsia data set it's going to have a lot of things related to cancer oh yeah exactly preclampsia is a disease that shares a lot of the process like you see what they say the placentation is like a controlled cancer it's what they say the placentation is a controlled cancer cross, try cross since well I was asked no no no k-raz k-raz your work in phenomenological data sets my shirt 20 we just have one transcript transcripts yeah what I learned what I do work with immunologists is that they like to name different transcripts, different genes so most of their genes have only one transcript it's the receptor it's 2-RA L-R-2-A R-2-A I think yeah we don't it's not on our data set but anyway you can try but actually this data is filtered right yeah I think we filtered the data too it needs to be differential expressed at least in one rate so it means that it's not differential expressed in any of the levels but yeah we actually need to keep the table without the differential expressed to generate the graph even if none of them are yeah like a joy what one thing that we are doing and it's already ready for the profile plot is for you to plot the entire list and save it on a folder yeah I haven't really showed the data but yeah basically the table we use to plot that like I said we always try to keep the actual data that were used to plot so we have that information in the table here like you could filter that by if you want to filter gene name equals k-ras and you I lost what I was here you would see what is that not the data not the plot you would see the where that those values come from so we see that actually you'll see an increase in the mean around the treatment but the standard deviation increases also so that's probably why it's not appearing both at the transcript and oh yeah this is just one transcript right here have to check why because it's not showing the transcript but yeah it's probably not because no transcript was differentially expressed but yeah it's pretty much that so it didn't get everything so anyway now we go to the parts that the people like the most and that the bioinformaticians dislike the most at least I don't know if you guys have the same opinions me but I really dislike the functional enrichment yeah one common analysis done in gene expression data is what we call functional analysis functional analysis actually could can mean the totally different approaches but most of the time you will see or go on top of gene ontology analysis people like to call them but those approaches are divided in two main approaches usually just the over representation analysis when you find a group of genes that are differential expressed and try to get a functional notation from them and the last bias and approaches that is the gene setting enrichment analysis less bias in the sense that you with genes are differential differential expressed before the analysis yeah like you you basically that's why we need the full that table because we're going to use them as background FG says one that uses very well the background yeah here we are using the FGC a package package that is a re-implementation of the GC a method of the GC the gene setting enrichment analysis that comes from 2004 developed at the Broad Institute and it was in the out-implementation is actually really good really fast they have everything written in C++ in the back end and the and the kind of data we need for doing the that analysis actually an annotation for example pathways download the act on download Joe and put on this format the GMT why you can download directly this format from the MCDB website exactly and like depending on the data set that you want it's going to give drastically different results yeah this is a really common use a data set that annotation that comes from the molecular signature database it's also hosted by the Broad Institute they have to follow that area of cancers for example you always see about like people talking about how mark gene sets or oncogenic signatures and they have annotation already for genes in those categories we a lot of people use this C2 data set that is the curated gene sets they are basically gene sets that come from already published studies so they have most of the time the name of C2 the name of the study that publishes that like the C2 is the one that the globates that englobes everything it's a very big one but then you can go to the smallest ones exactly and it tries to integrate different database like biocard, reactome and this can basically be interpreted as biological pathways or pathways yeah functional when they say functional annotation we are talking about it's one of the person on the lab asking to do the GO that's what you're talking about and then we included in the package already this annotation of the reactome subset of the C2 molecular signature database yeah it's like this is basically we got the C2 and then we excluded the pathways that superpose the one another so we did a few third C2 dataset pretty much and what is this it's loading and it's a list with all the pathways and the genes per pathway yeah just to show the first one for example if you look at the list here we are giving one saving the GMT format and show you how to transform that in an R list but basically this object is a list that the name of each element of the list is the code the name of the pathway and the elements of the vector inside the list each element of the list is a vector of the names of the genes in the pathways that are big part of the pathway yeah and what FTC uses as an input is actually the full changes of each gene on our dataset so like what we did basically is to transform that GMT to be a transcript level GMT and separate the coding from the non-code the productive from the unproductive yeah and and here we show we actually created a function that is a wrapper around FTC to run the enrichment and like this can take some time what is it doing it is making a vector of like it's separating the table between productive and unproductive transcripts and creating a vector of each of those transcripts ranked by their full change so you're going to have two enrichments in the hands or more than two if you choose to do by the different groups one enrichment related to a type of transcript X one enrichment related to type of transcript Y exactly so what that function can enrichment gives you it makes your table of differential expression the genes at least and you can define a p-value cutoff and it will give you a value that's common to be using that kind of analysis that is this NAS value is the normalize the enrichment score and the p-value in the FDR future p-value that that can be used to find what pathways are more relevant and we also what that function is doing is separating what is the enrichment just for protein coding genes and for each category of transcript that is annotated like I said it's all dependent on the annotation we have we try to do a work of let me find it we try to separate that enrichment table the not here unique we try to give different we generate it's all in one table but in that table we have the specific enrichment for protein coding genes we create this category that we call one productive that is actually the summary of all the non-protein coding ones but you can also specify the different kind of so you can have actually a functional enrichment of what the the pathways that are affected just by maintaining terms just by things that are known to go over them in no sense mediated decay it all depends on the annotation you use and also the functional notation but it could help drive some biology and we have some examples here for example here we try to let me increase that because it's figure it figure that case yeah I don't know let's try no great and we're like this path this part is the one that's going to get the most different because each one is going to have their own package of pathways that they like and that kind of stuff yeah here actually we haven't did a good job because we didn't plug in the name of the pathways it's just like if you choose another GMT it works yeah exactly this is just the name that that we kept in our GMT file the actual when you go over the the the gene set list is exactly the name of the elements of the list so you can have a table of conversion of the of the names here just ahead and you just have to change those names to the actual names of the pathways those are the hacked on pathways and they all have a number specifically like that very first one is something related to nitric oxide that I remember by heads the interpretation of that plot is basically that you have all those pathways being affected by being actually positively affected by by the protein coding yeah one thing one thing about the data set and it's a common in preclamption cancer data sets is that we see a lot of more things are regulated and downregulated so all the pathways that appear there have their S positive so basically what Lucio is saying is that the interpretation is that like the productive pathways are being regulated on that those cases and the unproductive are being regulated on other cases yeah that's it and another example that is the same yeah and like I said all that is we are using just the plot to try but actually that table have much more information over the which genes are actually causing in the GC experiment the have that concept of the leading genes that are the genes that actually are responsible for for leading the this enrichment score and then you can also drive that in that case the let me see the transcripts that are causing this is this enrichment anything else on the functional yeah this is actually the most powerful tool like I said here we are not bringing any specific annotation to do that to use the generic one but you can try to use a more specific annotation for your your experiments and here an example of showing just the separation of the null coding what you call the unproductive um unproductive path yeah the path was regulated by productive transcripts so you see that we have way more retained introns probably because we have way more retained intron transcripts on the data set in general but like there are some pathways that are exclusive to like processed transcripts or 2NMD yeah and this again if you are doing that analysis because you are studying a disease that you know that nitric oxide nitric oxide is is known to be affected in that disease and then if you look at just as at the gene level you can find that it's being upregulated in the gene level but actually there is some biology that is also affected affecting that same pathway in the null coding or the unproductive one specific case like null 2 doesn't appear on this data set at the gene level but if you look at it on that transcript it appears and it upregulates the nitric oxide pathway a good example so yeah in the end it can help you drive some biology or like I said actually find novel biomarkers for example first is related to finding biomarkers for a specific disease or biomarkers for specific phenotypes this could help you see that it's not really just the gene not just the protein that could be a biomarker the non-coding transcript on the non-coding isoforms can actually help you understand that that biology and actually find no markers for that phenotype you studied this is a part of the of the of the workflow that is exactly still in development but we try to add some functionality to at least understand and visualize the the contribution of the genome context one problem that we had is that like alright this transcript is different from that transcript to what is the difference and then you need to go back all the way to assemble to see the difference and then you need to download and assemble PDF and start messing with the images so the idea was to try to plot the genomic context plot for the different isoforms in a way that you can see the difference between the introns and actions of that isoform yeah basically the looks better in the hand read version of them but for example this is a I think I left the let me see if the the example is better here for example this is the this is a long no coding RNA that's why everything here is at the same color but basically we brought this example because it's it's really really who's the difference between the those are plots of the X then and what could be the introns of that isoform and actually you see that some of the isoforms don't even share X or so they could be actually interpreted as totally separated genes they probably share very little biology from that to that and that this is just a way of visualizing that information and let me see I added a protein coding an example just to just to help visualize that right now the function we still want to add more information on actually showing which of the we want to have the same information that we have on the that plot also represented here it's still working the way so it will show which of those isoforms are differential expressed then actually have a metric or rank them by which are and don't plot the others something like that this is a good one because they are very different also yeah exactly for example the EGF our gene is a gene that is known to be affected in preclampsia and you can see that it's a proliferation botanical proliferation factor yeah and you see that for example even if you look just at the protein coding genes they they don't share all the axons and then on coding ones share even as the there is a just a way to try to explore that yeah Joy and she said that they are very different from another and they are very very different from another you can't like say that they have the same function at all because they are so different yeah we here we are using just the annotation from the information from the annotation to plot but like said we actually want to include some more some more some more functionality on that plot and yeah actually you bring a really good discussion so knowing that those genes have different transcripts that are totally different from one another what is actually the functional element of the genome is the gene is the transcript is the protein is the protein so it's actually there is a lot of concepts that need to be he working around the around the gene concept and like the way that gene code uses the annotation is that if it transcribed from the same region on the same strength yeah on the same strength it is a transcript from that gene so even if it's a very small transcript like that's the yellow one down there or if it's a very big one it has a big downstream region these ones that have a downstream region are what people call dogs or like that have a region downstream of the gene and like they are all considered the same isoform on the gene code annotation because they are transcribed from the same genomic location yeah and like for the computational biologist we didn't even enter on the uncanonical translation pathways so yeah like here we are using just how can I say common knowledge what is already known in the literature what's already some database but you could even go deeper and generate your annotation really use a data driven methodology to find which transcripts are really detected in your in your in your dataset doing a denovo annotation or a reference based genovo annotation that actually uses known transcripts to to expand that so actually there are several methodology that can enhance that but usually and they are steps previous to that analysis which group of transcripts are we using to to to to go downstream in the analysis is something that is not on the scope of that workflow yeah the thing is like we have a generic concept that is what codes for protein is the thing that starts on start codon has a actions and stops on the stop codon but we know that today that there are other things that can code for other things so yeah here we have linked to some of the reference between the methods we recommend or the methods at least we use it to to find the approach that the paper where the dataset came from exactly that one so if you are interested in the biology that is on the paper they don't explore much the different expression because they are on RNA editing paper that's why like the libraries are so deep yeah and that's actually one example to show that if you are just a if you are a data analyst by information data scientist that is a plethora of information of data on public database because most of the times that kind of dataset it was it comes from one paper that they most of the time already knew what they wanted to show like they most of the time they have one gene that they are studying in the lab and do that kind of experiment just to show that we see that the gene X is upregulated but they don't really explore the whole dataset so any work can be re-explored, re-analyzed to go deeper in the biology and actually we do that that's like our job it's because like me and Lucia are from a group in Brazil in Brazil nobody does sequences it's expensive to sequence anything and in a good level so our work is basically with public data so I think to finalize we can show our social networks like follow us on Twitter and on LinkedIn you can show yours first Lucia yeah please my name is Lucia Heyros so I can be I can be found on LinkedIn and also on Twitter I don't really know my Twitter account let's see it's my Twitter account it's Lucia Heyros I should prepare a slide with that information but yeah definitely can be found around social media we have a preprint that is under analysis right now that uses this approach to explore the SARS-CoV SARS-CoV to the coronavirus expression of immune response to SARS-CoV especially focused on that unproductive splicing concept yeah and like it's a thing that while doing this preprint we discovered that a lot of genes directly a lot of virus directly interfere with the role host by the machinery and COVID is one that does that but like ARP is virus and influenza also do that they can directly interact with the spliceosome in affect splicing related genes yeah the quality of the of the features are not the best here but yeah we that the preprint we already put like there are some specific group of functional pathways that we see in enrichment just on the unproductive genes we see in pathways that can be explained that sense and actually this paper is exactly it's just a preprint right now but it actually gave a lot of discussion on Twitter we have 700 comments on Twitter about around the paper so at least it's bringing it's bringing some attention to to that work if you show like figure four because it's a very good show of the mechanism four or five something the last one no no it's the last one it's a very big paper I know now the other the next one this this is our mechanism projected mechanism that the SARS-CoV-RNA enters the cell and it can directly interact with host spliceomachinery and like there's like we integrate the data not just from RNA-seq but also from immunoprecipitation and proteomics data set so we can see that the virus directly interact with some of the host proteins and these are the kind of information that you can only get if you do a transcript-level analysis like the original authors from this data didn't get to that level yeah like something this is the analysis of already published data so yeah there's a lot of biology that can be unraveled from any RNA experiment because the amount of data generated by RNA-seq experiments is really alive one paper is never going to explore all the possibilities of one data set and like not even diving in the single cell part of that now the first data sets I think we can stop sharing and we can discuss again yeah that's true let me stop my screen share yeah we are done with the with the demonstration so we can discuss and answer questions and try to yeah if you have any questions discussions or if you want to change gene names just let it go bring the discussion every time you see someone saying that they are going to do a differential gene expression to find new potential biomarkers or anything say that oh why don't you do that on the transcript level yeah I don't do a transcript level and every time someone says that you want to do a QPCR to confirm you do like sad things yeah but they are like pretty much it was that like if you want to see a smaller version with a more readier package we are going to be there at the bioconductor it's on the last day of the biocede 2023 on Boston and come to talk to us in person on Boston like we love to chat on events yeah hopefully at least drove the the the thirst for knowing other categories of analysis to look to gene expression dating a more broader vision and don't really just accept that gene expression is one simple thing that you just see like differential expression genes and the biology around that's definitely much much broader it's very molecular biology is hard and there are a lot of questions to be answered so I thank you everybody for coming we have like a lot of people started 50 people and like we have like more than half so it's great well thanks everybody if you want to follow more like I have a compromise to finish this paper until the end of the year so we're going to develop this package a lot on the next month it's still working it just needs some small adjustments like the colors and the last plot yeah we have some more it's a project that we we have the compromise to to develop more and generate more functionality over it it's something that we really care because we basically use it ourselves so we are definitely going to work more on that yeah so the easier it is to use the easier for us it is also so yeah and thank everybody for being here if you have any comment so show me there go on github go on github and do suggestions also if you have problems just go on github and tell the problems right thank you guys so Rachel I think we can finalize here yeah great if we're all set we can just go ahead and end the webinar so thank you both for your time today thank you have a great day see ya