 Welcome to MOOC course on Introduction to Proteogenomics. In the last two lectures, Dr. Kelly Reddles have talked to you about some of the advancements of genomic technologies. In today's lecture, Dr. Kelly is going to talk to you again about transcriptomic studies, especially how to utilize RNA sequencing with reference to the various type of databases available and what are the challenges of using these technologies. She will also talk about a standard RNA sequencing workflow, which includes alignment of genome, count coverage per gene or transcript and differential expression study. Today's lecture, we will also talk about various softwares available for gene isoform quantification and for differential expression analysis. Dr. Kelly is going to talk about the effect of gene fusion, where genes from different chromosomes come together because of chromosomal translocation, inversion or interstitial deletion. The concepts of UCSC genome browser will be elaborated with all the data for each human gene studied and mounted till now. Dr. Kelly is going to talk about the advancements in the field of transcriptomics, where now one could also perform RNA sequencing using single cell. So let us welcome Dr. Kelly Regills to give today's lecture. So, standard RNA-seq workflow, you again, you do your next gen sequencing, you have to align it to the genome. The most commonly used aligner for RNA-seq at this point is STAR. So I would very much recommend you use that aligner if you are doing RNA-seq analysis. Then you do coverage, you count coverage per gene or transcript and then you do differential expression. So some outputs to this are things like volcano plots, if you are familiar with those. So you are looking at, let us say you have disease and healthy, you can look at the full change of different transcripts versus the significance level of those transcripts and you are looking for things that are sort of either here or here in your data. You could do hierarchical clustering of your data, so you can see if your disease versus healthy or different subtypes cluster together based on the expression of different genes. But there is some challenges to RNA-seq alignment. This includes the fact that there are introns, so with whole genome sequencing, right, you have chunks of DNA that you can just map back to your reference genome. Here, because you are looking at RNA, you have exons that have been spliced together. So you have places where there will be a gap and you have to account for that gap in your alignment. And so the aligners that work with RNA-seq have to be a little more sophisticated because they have to deal with the fact that there are these gaps and they have to figure out where they are and where the boundaries are and then record the junctions of these boundaries. So that is something that you have to keep in mind with these aligners. And then if the reads are in introns or in enterogenic regions, what does that mean? If we don't expect that to be in RNA, is it real? And I did want to mention some of the different ways that people do these counting the reads per gene because there are a lot of different ways to do it. So you need a gene model. Meaning you need a database that says these are where your exons are. This is where your start sites are. So we have these gene models. We have these databases. There are lots of databases available with differing levels of complexity. So for example, ref-seq or ensemble, these are all databases that have files that say this is what we expect to see at the transcript level. And you can kind of use those to measure how many reads you have of different transcripts or genes based on what we know about how those genes are structured in the genome. And there's lots of ways of actually reporting how much expression there is at the transcript or gene level. So there's RPKM, which is reads per kilobase million. There's FPKM, which is fragments per kilobase million. So these are similar. The FPKM is typically used for these paired-end reads versus the RPKM. Back to expression units. So as I mentioned, there are several ways that people will express the reads. So you have to normalize the reads. You have reads, but certain genes are long, right? So if a gene is long, you're going to have more reads that map to it because it's just longer. That doesn't mean there's more of that gene. That just means it's long, right? So we have to take that into account. We also have to take into account that maybe a certain sample just had more reads in it. But that doesn't mean that all of the genes in that sample are up. That just means there's more reads that we ended up getting in that sample. So these are two things that we have to normalize for. So there's really two different ways of doing this. The first one normalizes by the depth, so how many reads in the sample first, and then normalizes by the length of the gene. So that's this RPKM and FPKM. The other one normalizes by the length of the gene first, and then by the number of reads in the sample second. And that's this TPM. I don't know which one is better. I don't know if anyone knows which one is better. There's a review here on the differences between both. They have, again, strengths and limitations. So whatever your problem is, just spend some time thinking about this and know that there are different ways of doing this, and they have different effects on your downstream analysis. Okay, so in terms of RNA-seq software that I just wanted to point out, there's a lot of different ways of doing gene or isoform quantification. So there's several I've listed here. There's one of these papers. This one goes into lots and lots of details about them and more. So you can look into this if you're interested in learning more about how to quantify at the gene or isoform level. There's several differential expression analysis packages that DEC2 is a really commonly used one. There's also EDJAR. This last paper actually compares the two and talks about when you should use one versus the other. So I'm leaving these papers here for you guys in case this is something that you are going to do and you want to learn more about. But if you have specific questions about this, just you can come find me and we can talk about it. So one of the things that you get from RNA-seq, if you have, I think, for a lot of these packages, you have to specifically ask for this. So if you want this, you should make sure you're in your settings that you're asking for it. But you can get junction files, which we're going to talk about that are in bed file format. Also, a lot of these like ref-seq and these annotation databases use this bed file to say this is where an exon is. This is where an intron is. So these are the files that sort of tell you the structure of the actual genome. And so this is just an example bed file. Here I've included what the columns mean. So the first column is the chromosome. The second is the start and end of that. Sorry, that actually should be gene. I can correct that and send it out again. So it's gene start and gene end. A name, a score, a strand. And then there's this display info because there's these browsers we'll talk about a little bit where you can change colors and you can have the display if you want to put it up on a specific browser. You can have it look a certain way. So there's columns for that. The number of exons or blocks, the size of the blocks or exons, and the start of the blocks. And I'm going to go through an example about what this looks like. So for example, here we have one row from a bed file. So we have chromosome five. We know that this gene starts at this coordinate. And then we have block size and block start. So these are the exons. So we know that the exon is 126 base pairs long. And this is where it ends. So the start is, the block start is this plus zero. And then the block end is this start of the gene plus the block size, plus 126. And then you have the second block or exon. So you have, again, you start from the start of the gene and you add the block start number two, which is 4509. So this is where the second exon starts. And you know that the second exon is 78 base pairs long. So now you know where your block two ends. And then for block three, you have, again, the start of the gene plus this block start 24,849. You know that it starts here where block three long ends there. So then you know exactly where your exons are based on what the bed file tells you about the coordinates of these exons. So what these junction files from the RNA-seq data will give you is not just reads that cover different exons, but also where the exons connect. So if you have this alternative, the splicing that's occurring, where you have exon one spliced to exon two, there'll be a read where it will show that these two are connected, which is called a junction read, and it will be in this bed format. And so here, like, if you have exon two and exon three, you'd have another junction, junction two that would connect these, so you'd end up getting this. Or if you had a junction connecting exon one with exon three, you'd get something that looks like this. So these junction files are something that comes out of RNA-seq in addition to the expression analysis data. And the last thing that can come out of this RNA-seq data is gene fusion. So gene fusions are when a gene from one gene that can be from a different chromosome or from far away on the same chromosome is actually fused with another gene. So here we have gene X on chromosome one and gene Y on chromosome two, and you can see here that these two are connected because of chromosomal rearrangements, which typically can occur in cancer. So this is a pretty cancer-specific analysis. This is just a schematic showing each of these is a different breast tumor and each of these lines connects them based on how the genome has been rearranged. So you can see some of them have a lot of rearrangements, some of them don't, so you'll get fusion genes in certain samples and not others, but it's another thing to keep in mind when you have this RNA-seq data. This is another thing that you can look at as well. And so there are two browsers I'm going to talk about that you can actually look at. You can take your data and upload it or you can look at the gene annotation for a specific gene. So there's UCSC gene browser. Has anyone used this? It's pretty common and useful. So if this is something you think you're going to be doing, I would spend some time exploring it. So here you can see, you can look at a specific gene or part of the chromosome. So it has these are exons here and introns. You can have... There's so many tracks. You could put hundreds of tracks. There's so much data on here. You could spend days looking at it, exploring maybe your favorite gene, seeing what's available. They have lots of different publicly available data sets that have already been mounted. So you can just click on them and see, is this gene... They have epigenetics data that's up there from ENCODE. There's all sorts of things that you can look at. And in addition to that, there's this integrative genomics viewer. This is something, it's a GUI from abroad that you actually download to your desktop and then you can upload your own data and also look, they have different genomes and annotations that are already available that are mounted to it. So for example, the human HG-19 is up there and you can just already look at the annotation within that. And so this is just... I've uploaded a bunch of different data. And then I did want to touch on single cell RNA-C because this is the hot field right now. People are really excited about it. And so with this, you can actually measure the RNA expression in a single cell versus what we normally do, which is we just take a chunk of something and we look at the expression across many different cells. But there's heterogeneity, especially in cancer. So you may be measuring normal tissue, you may be measuring a certain clone of one cancer. So you're kind of deluding out your results. But with single cell RNA-C, you don't get the same coverage, right? So you only get about 1,000 genes that you're measuring versus with RNA-C where you're getting almost all, if not all of the genes. But it's still very cool. What you do with this, there's a couple of ways of doing this. So I'm just going to talk about one way, which is droplet bar coding. So you have a cell and you encapsulate it into a droplet, okay? So you have all your cells and you put each one in a droplet. And within that droplet, there's lysis buffer and there's everything you need, essentially, to do your library prep within the bubble, within the droplet. So you lyse the cell, you release your RNA, and you then can barcode it and do cDNA synthesis and get it all ready to do sequencing, essentially. And you barcode it so you know this is the one cell, everything in this droplet comes from one cell. And then you just break the droplet and you do the same kind of multiplexing you would do with lots of samples, you just do them with lots of cells. So then you just measure all of your RNA and then later, after you've done your RNA-seq, you can pull out each of the different cells based on the barcode. So that's currently how one of the ways that people are doing single cell RNA-seq and getting cell-based data. The one thing that I haven't seen, I know that some people are doing it, but I think it's a lot harder, is doing SNP calls from single cell data because it's just there's not enough coverage. So that's something that's not currently happening, but I'm sure eventually we'll figure out how to do it and somebody will do it and it'll be exciting. Yeah, say that again, sorry. It's similar to barcoding, like the multiplexed samples, right? It's like adapter barcoding, yeah. Yeah, yes, of course. I mean there's always error that we have to deal with experiments. I mean, I think we... So how do we account for the PCR amplification error? So what happens is, especially with... So when you make those clusters in your library-seq, when you're making your libraries, you're making many, many different copies. So if only one of them has a certain SNP, you're going to say that that was because of a PCR error. So it's usually just based on... The genome aligners kind of have all of that built in. I don't know the answer to that. The question was what if the error is early enough that it's in every copy? I don't know. That's a good question. Yeah, so the question... That's a good question. So the question is with Illumina, there was a size filter, essentially, right? Like you knew that your fragments were a certain size. I think actually they do... I don't know about that here, but I do think that that's incorporated into this. So you have kind of the similar size that you would have in your bulk RNA-seq, but you just have... It's just within... Everything's kind of included in the droplet that you would do at the bulk level. So it's a very similar sequencing process. It's just there's less RNA, so your coverage is lower. Coverage is much lower with single-cell RNA-seq because with bulk you have lots of copies. With single-cell, you only have a certain number of copies, so you only measure up to like 1,000 genes versus 20,000 genes with bulk. So in conclusion, you have seen that how we're studying transcriptome can be very useful to provide the first level of functional information obtained from the genes. If you think about the central dogma from the DNA, the RNA being formed in the process of transcription, and then from RNA the proteins are being formed in the process of translation, so the first set of functional information comes in the form of transcripts. In today's lecture, you have seen how introns become problem in the RNA sequencing data alignment as to the referral genome sequences. You also learned about read and understand the BED file in sequence alignment for data analysis and representation. I hope you got the concepts of RNA sequencing and how the droplet barcoding can be done for single-cell RNA sequencing along with the pros and cons of this technique. In the next lecture, Dr. Ruggles will continue discussion about genomic and epigenomic technologies with more focus about epigenomic analysis. So let us continue this discussion about genomic revolutions in the next lecture. Thank you.