 Hello, welcome to my talk. My name is Alessandra Bresky and I led the transcriptomics analysis of the ENCODE3 phase. I'm going to present the results of a collaboration between the teams of my PhD advisor, Roderick Gigo, at the CRG, and Professor Thomas Gingeros at Cold Spring Harbor. I'll summarize here our main findings, but I invite you to read our recent publication in the July issue of genome research. Although we don't have exact numbers, it has been estimated that there are hundreds of different subtypes in the human body, which are beautifully organized to make functional organs and systems. For comparison, when we look at the actual number of individual cells, it was estimated to be over a trillion in the human body. However, all these different cells in a single person share roughly the same genome. So we want to understand what differences there are in genome regulation that creates this diversity of cellular functions and phenotypes. As part of the ENCODE project, whose main objective is to build an encyclopedia of DNA elements, meaning classifying all sequences in the human genome and annotated function, will pursue two main objectives. The first was to characterize what makes each cell type different at the molecular level, and then use the cell type-specific molecular signatures to understand how cell types are distributed to form complex organs. For this, we performed an integrated analysis of thousands of data sets from ENCODE, from other consortia, such as GTECs, and from other published studies. In particular, analyzed data from over a hundred human primary cells, meaning extracted directly from tissues with very low passage number. And we looked at gene expression with RNA-seq, we looked at chromatin accessibility with DNA-seq, and promoted data with RAM page and gauge. Then we wanted to relate these analysis with gene expression data from complex organs, both healthy from the GTEC project or cancer tissues from the PICO project. In our analysis, we also used thousands of single cell data, which is a higher resolution than bulk RNA sequencing in terms of cellular specificity, and was very useful to confirm and generalize our findings. Unfortunately, I won't have time to talk about the single cell analysis here, but I invite you to read our paper if you are interested in it. The first thing that we did was classifying our primary cells by gene expression. As you can see on this slide, we have cells extracted from multiple sites up to body. So we were interested to see if the anatomic side had a stronger or weaker effect than the cell type on the classifying. Turned out that we could observe a pretty strong classifying by cell type. And in particular, we could identify four main cell types, mesenchymal cells, which included fibroblast, smooth muscle cells, and mesenchymal or adult stem cells, lanocytes, endothelial cells, and epipedial cells. So if you look at endothelial cells, for example, we had endothelial cells from blood, from lymphatic vessels, from the heart, from uterus, lungs, skin, and so on. And they all clustered together, irrespective of other tissue of origin. The other hand, if you look at an organ like lung, for which we had epipedial, endothelial, and mesenchymal cells, we still see stronger classifying by type than by the organ of origin. So next, we wanted to confirm this clustering is still observed at different layers of genome regulation. So here I'm showing Tuesday embeddings of about 150 primary cell samples, each profiled with different combinations of RNA sequencing, K-H, and DNA stick. It is not straightforward to integrate these data modalities, especially chromatin accessibility and gene expression. So we find quite remarkable that cell type differences are stronger even than differences by assay. And again, we see epipedial, mesenchymal, and endothelial cell types like before, but here we also see blood and neural cells that form separate clusters. And we couldn't show that before with RNA sequencing only because we didn't have the RNA sequencing data for those cells. If you search on Google or read a histology book in your medical biology training, you'll see that the main classification of tissues is among the traditional four types, connective, epithelial, muscle, and nervous tissue. This is, of course, a very valid classification. What we propose is an additional layer of classification that's based on the transcriptome of the cells. So in the context of the classification that we propose, we observed that in the telial cells, although histologically they are a subtype of epithelial cells, they have a very distinct transcriptional profile. Same for blood cells, which are very different transcriptionally from the other mesenchymal cells. And I will observe more similarities between other connective tissue cells and muscle cells, which we collectively refer to as mesenchymal. We performed differential gene expression analysis on the cells for which we had RNA sequencing data and identified about 3,000 cell-type specific genes. You can see that breakdown in this heat map, the number of genes is in parenthesis. And you can see that these genes are enriched for gene ontology terms that are descriptive of the related cell type. For example, endothelial-specific genes are enriched for blood-based development and epithelial-specific genes are enriched for epithelial differentiation. Here I'm showing an example of an endothelial-specific gene. In particular, it's a long-on-coding RNA of yet unknown function. In contrast, you can see how the flanking genes are ubiquitously expressed in all the cells. And again, to reiterate what we observed in the initial questioning, this gene is expressed in all endothelial cells regardless of their tissue origin. Amongst the cell-type-specific genes, we identified 56 transcription factors which form highly correlated clusters depending again on the cell type. We used the transcription factors for which there was a known DNA motif, which is in square, to filter regions of open chromatin defined by DNA stick. This slide is a bit complicated, but basically we looked if cell-type-specific transcription factors have more predicted binding to the promoters of cell-type-specific genes. So for example, if you look at ERG, which is an endothelial-specific transcription factor, we find accessible ERG motifs more often in endothelial-specific genes than in other genes in endothelial cells. In this slide, I'm showing only endothelial-specific transcription factors. We have a similar result for epithelial genes and mesenchymal genes, but I won't show you here in the interest of time. Then we also looked at the impact of splicing on cellular variation. In this slide, we showed the contribution of gene expression versus isoform usage to the changes in isoform abundances across cell types. We observed that most of the variation is due to changes in gene expression. However, we could still find more than 200 alternate splicing events that are cell-type-specific. This is one example where we show preferential inclusion of axon 6 of male 6, which is a myosin gene. Hymes and chymal cells compared to the other cell types. And axon 6 is important because it overlaps the EFN motif of male 6. We could have to find a few cases of cell-type-specific promoter usage. Here, I'm showing the example of a calcium binding protein, S116, where a proximal promoter is mostly used in endothelial cells and in melanocytes, while there is a diesel promoter which is used in the other cell types. I'd like to emphasize that all this data is available as a resource on the ENCODE portal and in the supplementary tables or paper. So I think we have shown that we could find a lot of molecular signatures that distinguish these cell types. Now we want to see how these different cell types are combined to form complex organ. For this, we use XL, which computes an enrichment score for a given set of cell types in a given sample. Instead of using the shift version of signature genes, we use the set of cell-type-specific genes that we identified previously. We ran XL on about 8,000 samples from GTX and we obtained enrichment scores for each cell type. Here you can see a summary of the results. The main observation that we can make at first glance is that the enrichment scores are across different tissue sites. More to that point, if we plot the samples based on the three dimensions identified by the enrichment scores for mesenchymal endothelial and epithelial, we consider the samples form distinct groups, which shows that each organ has its own distinctive cellular composition and this is reflected at the transcriptional level. Now I want to show three main examples which highlight the importance of considering different cellular abundances when stunning gene expression from complex organs. So the first example is about how a cellular composition can help us detect biases in the dissection of the sample. Here I'm showing a slide of Stomach sample where we can clearly see two main histological components. An epithelial layer, which is called mucosa and is made of epithelial cells and a muscularity layer, which is mostly made of smooth muscle cells, which are mesenchymal cells in our brother classification. The GTX consortium offers an amazing collection of histopathological slides from the same tissues where the RNA was extracted. And we found almost 200 Stomach slides which we classified as having only the mucosa layer, only the muscularity layer for both. The initial classification was manual and then we trained a super vector machine classifier which had an over 80% accuracy but we were able to use in column samples as well. Although I won't have time to talk about it in this talk. We then selected only the samples which we could classify as mucosa only or muscularity only. I looked at the expression of the most variable specific genes. We can see that epithelial specific genes are overexpressed in samples with only mucosa while they're vitally absent in samples with only muscularity. The same is true for mesenchymal specific genes that are only expressed in samples with the muscularity only layer. So it's clear that we can discriminate between sets of Stomach samples where only one or the other layer was accepted for RNA sequencing. The second example I want to show is about using cellular composition to characterize some histopathological states. Here I'm showing enrichment values for breast samples in males and females and some exemplary slides from the GTX portal again. It's pretty evident that there is a difference between male and female especially at the level of epithelial cells and this is expected considering all the doctor structures that are present in the breast of females. However, there is a condition in males that's called gynecomastia which is described as an enlargement or swelling of breast tissue in males. You can see how in these individuals the cellular composition is more similar to that of female especially if you look at the epithelial cells. The last example I want to show is related to cancer. This is especially relevant in clinical settings where a single cell RNA seek is very expensive and RNA is usually sequenced for the entire tumor sample. Here we are comparing cellular enrichment for endothelial cells across different tumor samples from the Pico project. Kidney cancer is one of the few cancers for which there are also normal sample sequence as part of the same project. And this is very important to verify that there is little batch effect when we compare enrichment across data sets that we could indeed apply this method to different projects. We can observe that all normal samples in green have similar distribution of endothelial enrichment scores while if you look at the orange samples that are the primary cancers they have higher scores which we think could be related to increased vascularity in the cancer. The only exception are samples from CURB US which is a cohort of renal papillary cell carcinomas which are known to have reduced vascularity compared to the other kidney cancers. So in conclusion we identified five major subtypes with distinct transcriptional and regulatory programs namely endothelial, endothelial, mesenchymal, blood and neural. We can characterize the major subtypes across different molecular acids and modalities. We can infer characteristic cellular composition of entire complex organs from iron and zinc. The cellular compositions are altered in some pathological states including cancer and it is really important to dig that into account when studying gene expression from entire organs. So finally I'd like to thank all the people that worked with me on this project in particular Manuel Muñoz Aguirre and Valentin Osher from the Gecko lab that our co-first author would be on this paper. I'd like to thank Tom Gingeros and his lab who was great collaborating with them and I'd also like to thank my Snyder who hosted me during my post-doc at Stanford while I was finishing up this paper. And thank you for your attention and I'd be happy to take any question.