 Okay, so let's let's get going So my name is Joe Rizofsky and I'm with Mark ghostin at Yale, and I'd like to thank The organizers for inviting me and also all of you who are still here late in the day, and they're going to be much later Enjoy So this this session which I'm going to be talking and followed by Tom Jingeris is about Using encode data to analyzing Then in terms of the analysis of non-coding RNAs So what I thought I'd present is some work that we did about a year and a bit ago looking at the comparative analysis of non-coding transcription Using encode and mod encode to do a sort of a comparative transcriptome analysis and see what we can learn about a human transcription using encode data So the a brief outline of my presentation is going to be sort of focused on three Components one is a little bit back of background and this sort of actually sort of Goes to the sort of the beginning of the encode project And for those of you who were part of encode at the beginning you could like remember that this was a little bit of a Talking point so it was about the discovery of pervasive transcription then follow up with a sort of more Recent look at pervasive transcription in the context of using encode and mod encode data with the advent of next-gen sequencing and then sort of drilling down a little bit more detailed in terms of pervasive transcription in the context of transcribed pseudogenes so Before I talk about non-coding transcription Let's put in context what we know about sort of protein coding annotation So actually if you look over the last decade or so The number of protein-coded genes and when I talk about protein-coded genes and thinking about sort of protein-coded gene loci the number of genes not the number of Isophones has actually become very stable the number So what I'm plotting here is the different annotation sets of the last decade or so for the human worm and fly Annotation sets and basically you see the number of protein-coding genes has sort of stabilized Early on if you go back a decade before this these these counts were actually wildly fluctuating, but these are basically stable So the number of isoforms per loci has increased But the number of total loci has basically become stable now of course what I'm going to focus on is more in terms of sort of non-coding transcription and on coding annotation so one thing happened about Just over a decade ago Was with the with the advent of custom tiling arrays people were able to assay the amount of Transcription that there was in the human genome In an unbiased fashion and basically when they did that Using these tiling arrays. They found that there was a significant amount of transcription occurring outside protein-coding genes and This was termed pervasive transcription and the first publication that documented this was from Tom Jingeris's group Where Kaepernoff and all way back in 2002 found that there was an order of magnitude more Transcription then purely accounted for by annotated axons in the genome So if annotated axons compose one or two percent of the genome Somewhere between 10 and 20 percent of the genome was being it was transcribed and then there were follow-up publications from Mike Snyder's group and then These were repeated both by Jingeris and Snyder in the context of the whole genome and then when pilot encodes Started which was the the pilot encode paper was published in nature in 2007 And this focused on 1% of the genome the conclusions of that were that about 15% of the bases Represented in the unbiased tiling arrays were transcribed in at least one tissue. So this was confirmed That there was a significant amount of transcription going on in the human genome that is not simply accounted for by protein-coding genes and If you can sort of visualize that so these are these signal tracks and this is from using as I said these Custom tiling arrays that would tile a chromosome in a sort of unbiased fashion of course you would have to skip repetitive elements and When we look at sort of the signal map that you obtain from a tiling array You see that the corresponding to the axons you see big blips axons are transcribed But you'd find lots of sort of transcriptions in regions like yeah So this should potentially correspond to either retained entron or novel axons or something else and this was things like that would account for the additional 10% of the genome and Just to add that sort of those numbers of 10% were actually because it was somewhat controversial at the time We're very conservative estimates of the amount of novel transcription Now the way at the time We analyzed that sort of data Because as I said it was tiling array data was there were algorithms such as min run max gap in order to identify regions that were transcribed and At the time I think we still call that we called these regions TARS transcription active regions or alternatively called transfrags So if you sort of looked at the signal map Along from the tiling arrays you could segment the signal into regions that are transcribed and some of these TARS would Correspond exactly to known axons some would overlap known axons But potentially would have sort of the boundaries that don't exactly agree and then some would correspond to just novel regions of transcription that weren't previously annotated so When we published these results way back when There were these were somewhat controversial and Even though we've talked earlier today. We've talked a lot about or we've heard a lot about sort of regulatory sort of thing in terms of transcription a Lot of those sort of the fact that a great portion of the genome is transcribed a Lot of people didn't really believe and this is from a Plus biology piece They concluded that the majority of the sort of pervasive transcription or novel transcription was due to technical artifacts So one thing I want to just emphasize is that this analysis that I'm showing you for full was using the tiling arrays These were all done using replicant analysis So these weeks these were results that you obtained that were biologically Reproducible if you repeated the assay so these were not just not just due to sort of noise Okay now one of the criticisms of the results from about this pervasive transcription was all this was all done using tiling arrays now Tiling arrays were great in terms of you being able to tell the genome I have that they had problems and that's part of the reason we don't really use tiling arrays to do these sort of assays anymore and one of the main reasons is Tiling a tiling arrays had issues of cross-hybridization so a lot of the oligo probes that told the region would have either sort of depicted your Matches to the exact reverse complement of the thing you were talking the targeting But also you'd have specific cross-hybridization and then you'd have sort of more non-specific crimes cross-hybridization So this was an issue which made some people quite skeptical about those results but of course with the Advent of next generation Sequencing we could do the equivalent on a Transcripted a transcription array base assay using on a seek and even though the example I'm showing at the bottom is showing using for chip But the the the idea for the RNA equivalent of this is the same You had sort of a ray-based signal which became much cleaner When you have when you used dip the RNA the the sequencing equivalent or on a seek Not to say on a seek doesn't have its issues in terms of mapping and stuff like that But it was a lot cleaner assay and actually this is just for for sort of amusement If you look over the last couple decades, and this is NIH funding For grants that involved the the term Microwave is the sequencing you can see sequencing is sort of basically blown up While microarray sort of plateaued in the mid to mutilate 2000s. It is declining Which we all know there's a lot more people doing sequencing based assays than arrays so In order to one of the in order to address this issue of sort of pervasive transcription We well this wasn't the main goal of this project But with this one one of the things we could do using this data the NIA and HRI funded The mod encode project to parallel the encode project and part of the reason for doing mod encode was to do the equivalent types of assays in the model organisms worm and fly and To be able to compare against human to see what we can learn in human by doing this comparison across these Three very distant organisms these three metazoans now in 2014 we published this paper which summarized the results of a huge amount of data So the blue is the RNA seek data the The dark blue is the RNA seek data the light blue is the chip seek and then the the green is the chromatin modification histone modification data and even compared to the mod encode publications from 2010 and the human encode paper from 2012 the 2014 papers had significant deltas in terms of additional data compared to those previous publications So in total there's about 3,000 data sets comprising about over a hundred billion reads So we wanted to use this resource to address the question of pervasive transcription So one so it's in order to do this analysis and look at the amount of transcription we detected One has to be very careful in order to be able to set these the data sets from different organisms on an equal footing So we set up a way to uniformly process all the RNA seek data in the companion across each all three organisms And we use this method to set the threshold Across all three organisms because all three organisms were sequenced with different number of tissues to different sequence adapts and the genomes Obviously, I have different sizes so I'm not going to go into too much technical detail But if you look at the exon discovery rate as a function of a novel Todd discovery rate You can set a threshold at 95 percent Which lands up actually being a fairly conservative threshold for detecting transcription So that you could set the threshold for detecting transcription uniformly across all three organisms in a consistent fashion So when you do that or let me Just go before so we did that and one other thing we're going to do which I'll summarize in the next couple slides is To use all this transcription data in addition to all the other encode data that we had available to try to predict novel non-coding RNAs So we had this paper that we published a couple about five years ago Where we used a machine learning approach in order to predict novel non-coding RNAs using no novel a known non-coding RNAs and a variety of different features and We chopped up the genome up into windows and Calculated each of these features for those windows and then using a machine learning method It actually landed up using a random forest approach And you could see that even though secondary structure lands up being a very strong feature by itself It's not the sole feature by integrating multiple features You get improved power in order to about and predict novel novel non-coding RNAs So we use this method and we applied it to the data so in summary before we looked at Non-coding novel non-coding annotation we could look at annotated non-coding annotation that We had in all three organisms So this at the top we've got exons of protein-coding genes and We can see that in the human genome Exons comprise about and this includes a UTRs comprise about three percent of the genome Worm, it's about 34 fly. It's about 28 and then we could look The equivalent numbers for pseudo genes search pseudo genes is about 1% of the women worm genome fly It's a significantly less and then there's a variety of Non-coding RNAs annotated non-coding RNAs as of that at the the best annotation available in 2014 So this includes microRNA precursors tRNAs nRNAs sRNAs link RNAs and pi RNAs and When you add up all the not the non-coding RNA annotation What you find is about point six percent and about two and a half percent of each of the women fly genome are transcribed in terms of annotated non-coding RNAs So if you add up protein coding genes pseudo genes and all annotated non-coding RNAs This is basically the sum total of all Annotation things that are annotated to be transcribed in each of these three genomes So we wanted to look at beyond anything that's annotated. What is novel non-coding non-coding transcription in these three genomes? So when we do this using the thresholds I showed which we pecked uniformly across the three genomes What we found was about 30 percent about a third of the genomes outside annotated regions So outside exon mRNAs of exons exons of mRNAs pseudo genes and annotated non-coding RNAs We found about a third of these three genomes are transcribed now that phenomenon of pervasive transcription was initially Reported mostly in the context of the human genome So a lot of people thought this was specifically a phenomena of the human genome but actually what we're showing you if you analyze the data in the Equivalent fashion you find this result is basically it doesn't matter which genome you're in and these across these three very distant organisms you're basically finding the same result at a pretty stringent threshold. So this is conservative and Reproducible in terms of these are results that are repeated when you do the experiments in replica You found about you you find about a third of the genome is being transcribed Now that doesn't mean that this third of the genome is necessarily biological function in terms of making transcripts that do something Specific but a certain fraction of them might be So when we use this method that I've mentioned two slides back to predict using the supervised machine learning Method to predict novel non-coding RNAs you find you can predict about another percent of the genome So this is small. So the the number of novel Non-coding RNAs that are of the type that we already have annotated. It's not going to add that much more But there's still a large portion of this unanalyzed Unanotated transcription that's going on that we don't really know what its cause is So one of the obvious questions people ask is What is their sort of expression profile of the these TARS and these are the novel TARS? So if you compared against protein coding axons So this is frequency versus expression on a log scale You see that protein coding axons tend to be more highly transcribed than these novel ones However, the novel ones do have a smaller number, but there are some that are very highly expressed and You could also look how these novel transcription Where they occur in the genome relative to other things? So one thing we looked at is in the context is how they relate to the positions of Enhancers all these are distal hot regions and distal hot regions are basically high occupancy regions which are Probably another type of regulatory element Maybe it had enhances to and what you find is a significant fraction of these sort of occur within these regulatory Elements and you actually can do the fishy exact test and just compare to where things would occur by chance And you find a significant enrichment that these This transcription is not just randomly occurring the genome even though it's a third of the genome. It's occurring At these sort of regulatory side sites So I think this gives you another evidence even if these sort of this novel transcription is not necessarily making RNAs Where we know the function it's sort of indicative of regions of the genome that are sort of biologically active or biologically important So maybe that cause it's a enhancer and this is sort of some ER E RNA sort of nearby or something else. It's just there's a lot of stuff going on in these regions so Just to continue this we wanted to look at if we could actually sort of take these novel transcribed regions and sort of see whether they correlate with sort of near nearby Additated exons or non-coding RNAs and what we found is When you do that is you could find these novel transcribed regions That strongly current that have strong correlations with Protein coding genes or non-coding genes In each of the three organisms where you find sort of an orthologist Protein coding gene or non-coding RNA in each three organism and a corresponding Non-coding RNA that has similar correlation and you actually you could find examples where they correlated versus and Anticorrelated and the inference is that these novel things are actually even though you don't have sort of Symphony between the three organisms you can find an orthologist non-coding RNA in terms of the behavior and the sort of relative position to the the orthologist gene so just changing Focused a little bit. So I come from Mark Gerstein's lab and Mark's lab has Focused for a long time on pseudogenes and in his mind and I probably agree with him pseudogenes are among the most interesting indigenic elements in the genome and Formal properties of pseudogenes they inheritable they have they obviously by definition of homologists to a functional element and The sort of default assumption or default position a lot of people have is that pseudogenes have no function However using n cut or so so just as a reminder for those who may be not quite familiar There's too many mechanisms that you can create pseudogenes either by sort of genome duplication or you have a transcribed RNA that is Retrotransposed and sort of inserted back into the genome and it requires sort of a variety of different mutations such as premature stops Frameshifts and so on in order to disable the gene So this the the pseudogenes can't function at in this as the same as the parent gene because it's the the actual protein coding potential of the pseudogenes has been disrupted so the lab has sort of Been involved in annotating pseudogenes in the genome So we've got our own pipeline for doing this and as part of the gencode project We've been collaborating with Retrofinder, which is Santa Cruz and a Vanna is at sana to combine the annotation output from the three different pipelines in order to Get a consensus set of pseudogenes that is the basically the consensus set of pseudogenes and this is part of gencode and What we've in what I'm going to show you is we get us we've annotated these pseudogenes by Combining with thousand genomes data, but more importantly encode Data to see what we can say about whether these pseudogenes are really dead and What you see is Now people have reported pseudogenes being transcribed for a while now, but a lot of people were quite skeptical of that because time Okay So I'll go through this quickly. So people have reported that pseudogenes are transcribed But people were quite skeptical because they would say well, it's just co-transcription or false positive Transcription from the parent gene that the parent gene is transcribed and you're just getting sort of sequence reads With mismatches mapping to the to the to the pseudogen However, this is example when you have lots of data and this example is done in worm Is you can see that if you look at the transcription of the pseudogen relative to the parent gene in a variety of different T-cell stages you find it's not correlated. So this can't just be as a direct Consequence of the parent gene being transcribed and reads mismatchment. I mean this is not correlated And when you do find this approach you can find that conservatively there's probably more than a thousand human pseudogenes being actively transcribed and We've done Experimental validation using RTPCR using primary specific RTPCR to prove that these things are actually transcribed and This is the one slide. I just wanted to emphasize So what we did using encode and mod encode data and not just the transcriptional data is We took all our pseudogenes in human worm and fly and we layered on all this functional annotation data So you can sort of differentiate those that are transcribed those that have active polymerase those that are have active chromatin mark such as K-27 acetylation and those that have transcription factor binding and what you find is the result is there's a small number of Pseudogenes for each three organism where you see all these things occurring But there's a whole sort of spectrum of stuff in the middle where they sort of partially active where maybe you don't see polymerase But you do see transcription you do see some sort of chromatin mark active chromatin mark So the consequences of this is that there's a lot of pseudogenes that are potentially Have some sort of that aren't functional in terms of their original protein coding gene role But it potentially have acquired some new functional and this is one form of pervasive transcription And this is some examples and they have in the literature. There's some known examples specific cases of Transcribed pseudogenes that have been known to acquire new functions such as endogenous sRNAs MicroRNA decoys and so on So just to summarize. I'm not going to go through this in detail Just gave you the history of pervasive transcription our results demonstrating that pervasive transcription is in fact and is actually conserved across all of these three model organisms including human and that Transcribed pseudogenes are can be well characterized and there's lots of evidence that there's many how many pseudogenes have a lot of biological activity and Have some interesting potential roles and just for the last slide just to emphasize so this was done as a big coordinated project in 2014 and It was led largely by Mark Gerstein with it and a number of members of Mark's lab and the data was principally from Brenton Sue Tom and Bob and Christina and Baikang Led the the pseudogen effort which I reported on at the end. So that's it In your talk you mentioned that software predicts not a normal or normal Non-coding RAA is that something a part of our input project is that really somewhere? Yeah, if you I'm not sure if it's the link is on the current encode I know it was on the previous iteration, but you could definitely go to the our website Okay, go see lab.org. It's all available for download. My second question is does the input project add any kind of a new long non-coding RAA to the annotation like Jin code annotation or yeah in them annotation in terms of non-coding or yeah, yeah, no one yes Sure, I mean that's so this of course is most of this was first focused on encode to data But there's a significant effort as part of a code 3 to identify novel Long non-coding RNAs other types of non-coding RNAs, and I think I don't have time That's probably what Tom's going to be talking about in the next talk Thank you