 So good morning, everyone. I feel a little disoriented. I was here on Monday. Had to run back today. And so I want to, so I hope this makes some sense, you know, at the end of the presentation. I thought I would actually focus on a very, rather than try to capture the sort of the entire landscape of what the mouse encode project was like and how it was compared to the, what we found into the human one. But I'd like to focus on a very specific area, because in some ways I think, vis-a-vis the interests that each of you have, often it's less the whole gamut of the whole genome, but more specific kinds of questions which call into, to use some portions of the data. So I'm going to try to talk a little bit about that and I hope, I hope I can convince you that, both from a global point of view and also from a fairly narrow specific point of view, the data provided by encodes is of some value. So let me begin with a, a set of, a summary graph which we actually talked about in a paper about two or three years ago. It actually was the summary paper for the phase two encode project in the transcriptome studies. And this is, this, I love this figure, it took me a long time to put it together, but I really love this figure because it really triggered a whole variety of questions which I hope we can address in the following slides. This, this presentation, this figure basically points to the fact that when you look at the landscape of expression of the human transcriptome and you, you do that admittedly in a very narrow range of biological circumstances, namely 15 cell types, all of which are, are, you know, have been in culture for a long time. Even the ones that are primary have been around for a while. And we also looked at not only this narrow range but also in, in a few of those cell types, what was going on in the nucleus, what was going on in chromatin, what was going on in the cytosol in terms of what the RNA profile looked like there. And about, in this 15% of, 15 cell lines and so forth and different compartments, we saw about 70% of all of the annotated gene regions from GenCoD expressed. And how they sort of played out was that there were, if you, if you divided arbitrarily, the world of transcription into the annotated protein coding genes and the annotated non-protein coding genes and all of the new material that we, all the new RNA transcripts, which admittedly are less defined in terms of their boundaries because they're really models, right, made from cufflinks estimates of what the transcripts tell us and where the splice sites for those are. And we clearly saw that if you look at the level, the log expression of the, what you see in the nucleus versus the cytosol and the levels of expression for each of those categories of genes. So each point within these clouds is a given gene in each of the 15 different cell lines. And you can see that there is a natural and very clear centroid for both the protein coding genes which is more localized in the cytosol and the non, the annotated non-coding genes which is, again, that centroid moves more towards the nucleus but nevertheless that cloud is shifted clearly to lower levels of expression and finally to the novel things that, again, admittedly are not as well defined as the other two categories but clearly a much more nuclear oriented collection of genes and very much lower expression than the centroid seen for the protein coding regions. Now, there's about six logs of separation, all right, over that range of expressed genes and so one of the things that intrigued us about this was that, you know, this very large level of expression difference, how well was that always, was the probability that any given gene could fall anywhere along that dynamic of six logs or was there a much more defined set of genes whose expression range was much more narrowly capable of being varied and so that's really the subject of what I want to talk about here because when we went to the mouse studies we revisited this question and so let me give you a little quick background for those of you unfamiliar that the study for the mouse and human studies differed considerably and this at first we thought was a very significant failing in the experiment because basically we had 18 cell lines from the mouse, from the human study and 25 tissues, all right, from five different developmental stages for the mouse, so we had tissues and cell lines and so making comparisons and finding out whether any differences there were actually attributable to the two different species would obviously be confounded by this very fundamental change but on the other hand we thought that if this idea that there was this dynamic range of expression and the variability of that expression was maintained despite, there was a story there that was maintained despite the variation in the biology that we were comparing here we would then ask the question whether that feature was actually a fundamental feature of things greater than just mouse and human in addition, every sample had multiple at least two biological replicates very importantly, we can say at the level of reproducibility using IDR that these were highly reproducible results in every sample and there had to be at least if we were talking about a splice site or initiation site for transcription a minimum of at least five set of reads for that particular sample we used poly A plus RNA and we sequenced paired end 100s with about 400 million reads per replicate so this is very deep sequencing fairly extensive coverage in fact for any annotated gene it was not called actually as being present as the gene unless 90% coverage of that gene was absolutely observed so it wasn't enough to see patches of anything you actually had to see full coverage of that particular gene now I'm going to use the word conservation in a couple of seconds and I want to make sure that you understand that I'm not using it in a strict genetic sense I'm not arguing that what we found was maintained, this conservation was maintained by purifying selection or generic way of using conservation that is to say they are maintained by mechanisms which may be other than genetic selection the two points to remember as we go forward in this is that the differences in the sample type and species underscores the significance of any similarities because again there's such a wide diversity of the biology of what we're studying and the conserved features highlighted are not dependent on common sequences in other words they do not have to be the fact that if we see this feature they share a conserved set of sequences around them so please remember those two features now what we saw in the first slide can be exemplified by the fact that if we look at any given cell type we can either see something that's relatively abundant let's say cytosolically localized or very rare something less than one copy per cell and often that triggers a sort of sense of caution on the part of people who study this like myself and we wonder should we actually worry about these rarer expressed genes and so what we see is that that as many of you are already familiar that we have examples such as jazz 1F in this case this is a work that a figure I asked from Arun Raj and he generously sent it to me that these each point for us in point in the jazz 1F is a single transcript decorated with at least 50 fluorescent probes so we're seeing here roughly the number of transcripts for jazz 1F which you can clearly see is predominantly cytosolic and averages on the order of about 10 to 15 copies per cell in the case of HoxD10 that gene in these cell lines this gene has an anti-sense transcript which is expressed at less than 10 copies per cell much less than that and what you can see here is that in this field virtually no cell has indication of this except for one and again it's a highly nuclear localized transcript so in the context of what we're about to say whether we're talking about high levels of expression or low levels of expression it in fact is important to realize that there may be talking about different programs that are ongoing when we talk about these different levels of expression namely in some they could be quite highly expressed even though the overall number is quite low now when we finished the mouse project we found that we could in fact see that the relatively undeveloped nature of the annotations for the mouse genome what we saw was that initially there were about 90,000 annotated transcripts for the mouse and gen code and we found about another 200,000 novel transcripts in the mouse again most of these being created by doing modeling but then supported by having information such as a five prime start site and a poly denulation site comprising some of these transcripts and many of these are now have been incorporated into the gen code annotation set and that brought that total transcripts up to about 290,000 and in human there were 164,000 and we found about 151,000 so almost we're starting to approach parity in terms of the number of transcripts in mouse and in human now strikingly one of the things that increased dramatically was the non-coding transcripts for the mouse there were about 3,900 non-coding genes and the human had 10,000 non-coding genes that number now is still is about 6,000 non-coding genes in mouse and about 12,000 in human so there's still a lack of parity very much in this class of RNAs which we think will eventually resolve by additional analysis of different sample types now when we looked at these data this is all sort of leading up to the story that I'd like to go over with you is that when we looked at this data we did it on a very 60,000 foot level and we correlated the expression across the mouse and human genomes just gross correlation comparing the average read density with the average level of expression and if you do that irrespective of coding, non-coding novel genes you see that there's a lot of genome and for regions that you could align because there are orthologs between the two species you had a fairly interesting sizeable correlation of about 0.6, 0.7 even though there's virtually no filtering going on so that across the entire two genomes the gross level of expression of RNA was relatively highly correlated in genes compared if you take a lineable but non-homologous, non-orthologous gene regions in mouse and human, you got a statistically significant but not very impressive about 0.4 for point correlation now the comparison in terms of those that gross level in terms of what the levels of dynamic range were of expression gave us a similar result in both human and mouse of about six orders of magnitude as we saw about three or four years ago in the human analysis now if we look at the orthologs just to give us a context before we actually delve into that, if we look at the context we saw about 20,000 expressed genes in mouse and these 20,000 genes were compared to about 18,000 sorry, 20,000 mouse expressed genes and 18,000 human expressed genes and these are all protein coding genes that are expressed now the overlap between that are sort of orthologous is an expressed between mouse and human is about 15,000 genes of those that are orthologous and expressed and shared between mouse and human of those 15,000 if you extend the analysis to about six different species rather than just human you have about 6,000 of those 15,000 genes that are expressed six different species and are orthologous among the six now if you look at the levels of expression if you correlate the log of the expression in terms of its mean its max or its minimum levels of expression and you compare that to the dynamic range of that gene over all the different cell types and tissues that we have that there was a class of expression which was very correlated between mouse and human as you can see across this diagonal but there was this collection of genes that looked like their log expression and max expression which is much more variable in other words its dynamic range of expression was much more variable in the comparison of mouse and human you can see this because if you look at the correlation of the number of genes in mouse and human that have sort of a maximal level of expression they have a very similar distribution of genes that have relatively the same maximal level of expression but if you actually look at the number of genes that are expressed that are expressed in between mouse and human that are different in terms of their range of expression you see the beginnings of two populations as we saw with that scatter plot one which had a fairly a large number of genes which had relatively a large set of variation over here and one that had much less variation if you look at sort of a two dimensional plot of this and you look at the data that's there what you see is that the number of genes in the two is relatively similar although there is a degree of greater in terms of having greater variation than those having a constrained set of variations and what you see now is that what the two gene populations that you see between human and mouse basically turn into a bimodal distribution of genes which have a variation in their expression of about two logs which very rarely actually extend past this two logs of variation in other words there is a class of genes whose irrespective of what tissue you have or irrespective of what species you have do not vary their expression more than two logs and this was quite striking to us and the constrained variation of genes represented a sizable portion of the total mass of the RNA that was actually being produced in all the cell types that we were looking at so these constraints group of genes are making a small amount of RNA that you see consistently in cells but 40% of the mass of all this RNA are RNA made within a very narrow range of variation irrespective of the species and or cell type or tissue from which it was obtained so if you go back to the original slide that we were talking about before and look and see that roughly that the we indicated that roughly 15,000 genes that were orthologs between mouse and human were expressed and of those 15,000 we indicated that in six different species about 6,000 of them were also expressed in orthologs but of that 6,000 of these are constrained in human and mouse and about 2,500 are constrained within all six species so this again this population of orthologous genes although seem to be shared in terms of their capacity to be constrained about 30%, 40% expressed but their expression is constrained between the species so in summary what we see is that about 70-80% of mouse and human orthologs are expressed in comparing cell lines versus tissues remember these are very different biological conditions 40% of the orthologous genes expressed in mouse and human are expressed in at least four other species 44% are expressed in mouse and human orthologs have constrained expression that is to say about less than two logs of variation in any tissue and about 17% of the ortholog genes expressed in mouse and human are constrained in their expression and finally about 40% of the expressed mouse and human ortholog genes constrained in their expression are constrained in any other species so this immediately triggered a sense in our mind that maybe these genes are critical to the survivability and to the functioning of cells and when we went back and looked at the overall correlation between expression and mouse and human we could see that this constrained set of expression was the reason why you see this correlation between mouse and human in this global sense and when you see that you wonder is this collection of constrained genes the genes that are part of the housekeeping mechanisms for these cells irrespective of what cell type it is or what membership it is and that triggered the question of what is a housekeeping gene we use this term almost invariably in every sort of study that we and we attribute a lot of biological meaning or use it as an exclusion class for many different types of studies but if you look at the literature and ask is there some kind of commonality to what is actually called a housekeeping gene then what you see is that taking a range of studies I only put a few up here from last year down for another 10 years ago there's very little overlap of what people call housekeeping genes in those studies and so as a proposal what I would argue is that one of the things that came out of this comparison of mouse and human was that there was a principle definition of what a housekeeping gene is in other words it's a gene whose variation in levels of expression was constrained irrespective of its biological origin in terms of species or cell type and having this principle method or definition would give us a mechanism by which we could actually make these kinds of discernible comparisons of whether things are constrained housekeeping or non housekeeping remember this other class of genes that are highly variable that are likely to be cell type specific or tissue type specific and so I would go back to the earlier presentation that basically asked the question of whether mutations in those regions of tissue type specific genes are the principal causative element or how much of this is actually shared among these genes that have these constrained levels of expression and now are forced out of that perhaps constrained element now one last set of observations from these studies was that if you ask the question do these constrained genes share for example sequence conservation at the regulatory regions within sites that could be used to actually control the levels of expression like sites around splicing and what you see is that there is very little sequence conservation in the regulatory regions of these constrained genes and among the other regions that could be responsible for their levels of expression the other interesting thing is that when you go back and look at the data collected in the other groups in the mouse encode project what we saw was that the constrained genes have patterns of histone modification which are quite different than those which are present in the unconstrained now you might say well this has to be true because the unconstrained class generally has a lower level of expression than the constrained group as a group so you would likely see more active marks but it doesn't explain everything because there are many genes which have relatively low level of expression that still have prominent marks compared to its orthologue in the unconstrained class so using the mouse and encode epigenetic marks this sort of suggests or points in the direction that there is a that these two classes of genes are under different kinds of regulation that there is a kind of global regulation that allows for this constraint to be maintained compared to those which are in the unconstrained class finally other questions which remain to be resolved is that what is this mechanism that is responsible for this maintenance and inheritance of constraint in the variation and then does this property extend to the non-coding transcripts and I can tell you that in fact we've looked at this now at least in terms of the annotated ones and the answer is yes in the non-coding class of RNAs there are clearly a subdivision in constrained versus unconstrained and so I'd like to end on that topic and welcome any of your questions and thanks for your attention. Thank you. Where do you stand on the issue of transcriptional noise for the non-coding group? So, you know this is going to turn into a philosophical discussion but the question is noise in the sense that often it's been used is a pejorative term meaning that it is a RNA whose role is either unknown or not important to the cell and it's just a byproduct of activities that are more stochastic than actually regulated in some kind of developmental program or expression program and the problem I've always had with this is that this well may be true but I think you have to ask yourself what happens to in the circumstance where you have transcription occurring in a given region and because you're a God you know that there's no actual biological function for that but the act of transcription is there is that a functional molecule because you now have to have this transcriptional event even though it's product itself is not directly important so the answer to that is that in my personal bias is that there is highly likely this kind of random and sort of variable kinds of expression that goes on in the cell which for all the randomness is not actually predictable and under control but I don't have an opinion as to whether that is something that's essential for the cell because it's mere presence it's mere creation in fact may be an important attribute of the cell Yes just one comment it's so difficult I don't know if you know the answer to the question of how many non-coding RNAs especially the ones around 200 base pairs or so have actually been validated functionally and the ones that have how many have only been validated by knocking them down and not by impeding their transcription per se I don't know what that number is but it seems to be very small right? I'll just give you one number the number of non-splice transcripts that we annotated in phase 2 was twice the number of non-coding splice transcripts so it's something in the order of about 250,000 non-coding non-splice transcripts which never really made it into any database because of the concern that this in fact could be a background DNA or incompletely splice things Hello that was really nice so if my understanding is correct most of the human cell lines come from different people and so they have some genetic diversity in them but most of the mouse tissues come from C57 black 6 males 8 weeks old not all but most of them what effect do you think that sort of narrow selection of murine genetic diversity has on your selection of the genes that correlate between humans and mice? so we've repeated this analysis now with human tissue and compared it to the mouse tissue and we still see this very clear collection of many of the same genes that we had seen here of constrained population versus not so in the human populations they were also males and females present and they were tissues as opposed to transformed or immortalized cell lines I guess I'm mostly asking about the selection of like the mouse tissues though it's understandable if you're doing a whole bunch of tissues you've got a sort of narrow you're like focused so they pick one strain but if you start looking in like other in red strains do you expect that set would like it larger, smaller I don't know or help change it? Well I mean what we see when we go to other species is the same general principle we see it that these other species had smaller number of genes which were orthologs and could make the comparison so the degree in which we've extended it past the mouse would argue that if we went to other strains we would still see the same principle of a group of genes being highly constrained in their variation but nevertheless maybe smaller number Very nice talk, I've been waiting the last two days to ask you this technical questions and we can follow up during the coffee break if it is not the right place so first is that you are showing that this is a nuclear transcript, this is a cytoplasmic transcript, in my case I have to look at nuclear transcript, cytosolic transcript and vesicular transcript so what strategies that you suggest for proper normalization because I face a real problem doing that that is my first question So in the case of normalization we used a couple of different approaches, one was basically to normalize using the spikens that we put into each of the experiments the other is to basically extrapolate based on the total amount of RNA that was actually isolated from each of the compartments, so those two methods were used first to determine whether we got similar kinds of results but also to take into account that obviously there is a lot less RNA in the nucleus than in the cytosol My second question is that what is the advantage of the paired in reads now that we can have very long single in reads covering most of the insert size Yeah, well are you talking about like Pac-Bio reads or? Yeah, maybe Life Technologies proton or even for Illumina, yeah So we have a lot of experience with that I mean, number one, you don't have enough money in your grant to actually do this experiment using Pac-Bio or it's a very expensive technology and number two, you could argue that by enrichment selection methodisms you could go after things which are either not well characterized or you want better resolution about what's going on but in our hands, and I think in the hands of many, many labs that have been doing this this enrichment approach has been relatively inefficient I mean, roughly 20% of everything that you target winds up having some kind of analysis capable of being done. The other 80% is off-target things that show up in your experiment so it isn't that it's not valuable those off-targets but the bulk of what you're actually spending time and effort on in fact are not the things that you want to learn more about so I think long reads have a significant improvement both in cost and technology development in order for it to be really challenging to the short read My question was like, if you can just do 200 base pair or 250 reads instead of doing PRD and 100 bases do you see any benefit for the PRD end? In the case of the 200 base or longer reads compared to the PRD end reads the advantages that you have the capability of covering larger distances in terms of the transcriptome the average insert size can vary to 500, 600 nucleotides could in fact be what's in that middle piece and by inference you can in fact ask and determine things that are not in the sequence but in between the two paired ends that you can't do from a single 200 nucleotide or 400 nucleotide read Thank you and the last one is very simple I think Are you reaching any sort of saturation with 400 million reads or what do you think that one transcript per cell was this state? So the answer is that most of the non-coding region coding RNAs are still relatively undeveloped and this comes from the long reads and from race experiments so we have I think still an appreciable learning curve for the non-coding RNA and even for the annotated coding regions the number has jumped from about 5 transcripts per coding region to about 9 or 10 based on the increased depth and resolution that we're capable of achieving now. Thank you. Alright we're going to break now. Please meet back here at 10.